java - 生のテキストファイルからすべてのXMLを取得しますか？

Question

ログファイルがあり、このファイルからすべてのxmlを取得するプログラムを作成する必要があります。ファイルは次のようになります

text
text
xml
text
xml
text 
etc

正規表現などを使用する方がよいアドバイスを教えてください。多分それはdom4jでそれを行うことが可能ですか？
正規表現を使おうとすると、テキスト部分に<>タグがあるという次の問題が発生します。

更新1： XMLの例

  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc
  SOAP message:
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
 here is body part of valid xml
</soapenv:Body>
</soapenv:Envelope>
text,text,text,text
symbols etc

ありがとう。

score 1 · Accepted Answer

XMlが常に1行にある場合は、行を繰り返し処理して、で始まるかどうかを確認できます<。その場合は、行全体をDOMとして解析してみてください。

String xml = "hello\n" + //
        "this is some text\n" + //
        "<foo>I am XML</foo>\n" + //
        "<bar>me too!</bar>\n" + //
        "foo is bar\n" + //
        "<this is not valid XML\n" + //
        "<foo><bar>so am I</bar></foo>\n";
List<Document> docs = new ArrayList<Document>(); // the documents we can find
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
for (String line : xml.split("\n")) {
    if (line.startsWith("<")) {
        try {
            ByteArrayInputStream bis = new ByteArrayInputStream(line.getBytes());
            Document doc = docBuilder.parse(bis);
            docs.add(doc);
        } catch (Exception e) {
            System.out.println("Problem parsing line: `" + line + "` as XML");
        }
    } else {
        System.out.println("Discarding line: `" + line + "`");
    }
}
System.out.println("\nFound " + docs.size() + " XML documents.");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
for (Document doc : docs) {
    StringWriter sw = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(sw));
    String docAsXml = sw.getBuffer().toString().replaceAll("</?description>", "");
    System.out.println(docAsXml);
}

出力：

Discarding line: `hello`
Discarding line: `this is some text`
Discarding line: `foo is bar`
Problem parsing line: `<this is not valid XML` as XML

Found 3 XML documents.
<foo>I am XML</foo>
<bar>me too!</bar>
<foo><bar>so am I</bar></foo>

score 1 · Accepted Answer

そのような各部分が別々の行にある場合、それはかなり単純なはずです：

s = s.replaceAll("(?m)^\\s*[^<].*\\n?", "");

java - 生のテキストファイルからすべてのXMLを取得しますか？

2 に答える 2

Related

Reference