java - jsoup を使用して html をプレーンテキストに変換するときに改行を保持するにはどうすればよいですか?

Question

次のコードがあります。

 public class NewClass {
     public String noTags(String str){
         return Jsoup.parse(str).text();
     }


     public static void main(String args[]) {
         String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
         "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> ";

         NewClass text = new NewClass();
         System.out.println((text.noTags(strings)));
}

そして、私は結果を持っています:

hello world yo googlez

しかし、私は一線を画したい：

hello world
yo googlez

jsoup の TextNode#getWholeText()を見てきましたが、使い方がわかりません。

解析するマークアップにがある場合<br>、結果の出力で改行を取得するにはどうすればよいですか?

score 109 · Accepted Answer

改行を保持する実際のソリューションは次のようになります。

public static String br2nl(String html) {
    if(html==null)
        return html;
    Document document = Jsoup.parse(html);
    document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
    document.select("br").append("\\n");
    document.select("p").prepend("\\n\\n");
    String s = document.html().replaceAll("\\\\n", "\n");
    return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}

次の要件を満たしています。

元の html に改行 (\n) が含まれている場合、保持されます
元の html に br または p タグが含まれている場合、それらは改行 (\n) に変換されます。

score 46 · Accepted Answer

と

Jsoup.parse("A\nB").text();

あなたは出力を持っています

"A B"

ではなく

A

B

このために私は使用しています：

descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");

score 44 · Accepted Answer

Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));

We're using this method here:

public static String clean(String bodyHtml,
                       String baseUri,
                       Whitelist whitelist,
                       Document.OutputSettings outputSettings)

By passing it Whitelist.none() we make sure that all HTML is removed.

By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.

score 7 · Accepted Answer

特定の要素をトラバースできます

public String convertNodeToText(Element element)
{
    final StringBuilder buffer = new StringBuilder();

    new NodeTraversor(new NodeVisitor() {
        boolean isNewline = true;

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = (TextNode) node;
                String text = textNode.text().replace('\u00A0', ' ').trim();                    
                if(!text.isEmpty())
                {                        
                    buffer.append(text);
                    isNewline = false;
                }
            } else if (node instanceof Element) {
                Element element = (Element) node;
                if (!isNewline)
                {
                    if((element.isBlock() || element.tagName().equals("br")))
                    {
                        buffer.append("\n");
                        isNewline = true;
                    }
                }
            }                
        }

        @Override
        public void tail(Node node, int depth) {                
        }                        
    }).traverse(element);        

    return buffer.toString();               
}

そしてあなたのコードのために

String result = convertNodeToText(JSoup.parse(html))

score 3 · Accepted Answer

これを試して：

public String noTags(String str){
    Document d = Jsoup.parse(str);
    TextNode tn = new TextNode(d.body().html(), "");
    return tn.getWholeText();
}

score 3 · Accepted Answer

jsoup を使用してこれを試してください。

    doc.outputSettings(new OutputSettings().prettyPrint(false));

    //select all <br> tags and append \n after that
    doc.select("br").after("\\n");

    //select all <p> tags and prepend \n before that
    doc.select("p").before("\\n");

    //get the HTML from the document, and retaining original new lines
    String str = doc.html().replaceAll("\\\\n", "\n");

score 1 · Accepted Answer

user121196 と Green Beret のselects と<pre>s の回答に基づいて、私にとって有効な唯一の解決策は次のとおりです。

org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

score 1 · Accepted Answer

/**
 * Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
 * @param html
 * @param linebreakerString
 * @return the html as String with proper java newlines instead of br
 */
public static String replaceBrWithNewLine(String html, String linebreakerString){
    String result = "";
    if(html.contains(linebreakerString)){
        result = replaceBrWithNewLine(html, linebreakerString+"1");
    } else {
        result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
        result = result.replaceAll(linebreakerString, "\n");
    }
    return result;
}

一時的な改行プレースホルダーとして使用したい任意の文字列とともに、br を含む問題の html で呼び出すことによって使用されます。例えば：

replaceBrWithNewLine(element.html(), "br2n")

再帰により、改行/改行プレースホルダーとして使用する文字列が実際にはソース html に含まれないことが保証されます。これは、リンクブレーカープレースホルダー文字列が html で見つからなくなるまで「1」を追加し続けるためです。Jsoup.clean メソッドで特殊文字が発生するように見えるフォーマットの問題はありません。

java - jsoup を使用して html をプレーンテキストに変換するときに改行を保持するにはどうすればよいですか?

15 に答える 15

Related

Reference