java - 行に特定の単語が含まれる前に、テキスト内のすべての行を削除する方法

Question

実際の HTML コードの前に空の HTML であり、実際には必要ない行を含む大きな HTML 文字列があります。

messageContent には次のようなものが含まれます。

        <td width="35"><br /> </td> 
        <td width="1"><br /> </td> 
        <td width="18"><br /> </td> 
        <td width="101"><br /> </td> 
        <td width="7"><br /> </td> 
        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

「Geachte」、「heer」、「mevrouw」を含む行より前のすべてを削除/置換したい。

出力として、私は次のものだけを保持したいと思います:

        <td rowspan="21" colspan="16" width="689">Geachte&nbsp;heer/mevrouw,<br /> &nbsp;<br /> Wij&nbsp;hebben&nbsp;uw&nbsp;inzending&nbsp;ontvangen&nbsp;en&nbsp;gecontroleerd.&nbsp;Hierbij&nbsp;het&nbsp;verslag&nbsp;van&nbsp;de&nbsp;controle.<br /> &nbsp;<br />

BufferedReader を使用して、テキストを 1 行ずつループすることを考えました。

try {
            reader = new BufferedReader(
                    new StringReader(messageContent));
        } catch (Exception failed) { }


        try {
            while ((string = reader.readLine()) != null) {

                if ((string.length() > 0) && (string.contains("Geachte"))) {
                    //remove all lines before this string
                }
            }
        } catch (IOException e) { }

どうすればこれを達成できますか?

score 2 · Accepted Answer

このコードはそれを行います。

public String cutText(String messageContent){
    boolean matchFound = false;
    StringBuilder output = new StringBuilder();
    try {
        reader = new BufferedReader(
                new StringReader(messageContent));
    } catch (Exception failed) { failed.printStacktrace(); }


    try {
        while ((string = reader.readLine()) != null) {

            if ((string.length() > 0) && (string.contains("Geachte"))) {
               matchFound = true;
            }
            if(matchFound){
                 output.append(string).append("\\n");
            }
        }
     } catch (IOException e) { e.printStacktrace();}
     return output.toString();
}

score 1 · Accepted Answer

最も簡単なのはXpathを使用することです。trまず、削除するへの正しいパスを知る必要があります。これを行うには、Chrome 開発者ツール( F12Linux/Windows の場合Cmd+Alt+I、Mac の場合) の [要素] タブを使用して、目的の要素 (ミラーガラスを使用) を選択し、右クリックしてを選択しますCopy Xpath。

コンテンツは文字列 (ファイルなし) であるため、(デバッグ時などに) 1 回コピーして html ファイルに貼り付け、Chrome で開くことができます。idxpath が短くなり、変更される可能性が低くなるため、障害のあるブロックの親に一意のを与えるとより安全です。

これにより、次のような結果が得られます。

//*[@id="answers-header"]/div/h2

まず、文字列をドキュメントに変換する必要があります。

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader("your string")));

次に、ドキュメントに xpath を適用します。

XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(<xpath_expression>);
NodeList nl = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);

無効なノードを削除します。

for (int i = 0; i < nodes.getLength(); i++) {
      Element node = (Element)nodes.item(i);
      node.getParentNode().removeChild(person);
}

次に、ドキュメントを文字列に戻す必要があります。

java - 行に特定の単語が含まれる前に、テキスト内のすべての行を削除する方法

2 に答える 2

Related

Reference