html - フォーラムからスレッドヘッドとスレッドの返信を抽出する

Question

フォーラムからユーザーのビューと返信、およびヘッドのタイトルのみを抽出したい。このコードでは、URL を指定すると、コードはすべてを返します。title タグで定義されているスレッドの見出しと、div content タグの間にあるユーザーの返信のみが必要です。抽出方法を教えてください。これをtxtファイルに出力する方法を説明してください

package extract;

import java.io.*;

import org.jsoup.*;

import org.jsoup.nodes.*;

public class TestJsoup
{
   public void SimpleParse()  
   {        
        try  
        {

            Document doc = Jsoup.connect("url").get();

            doc.body().wrap("<div></div>");

            doc.body().wrap("<pre></pre>");
            String text = doc.text();
           // Converting nbsp entities

            text = text.replaceAll("\u00A0", " ");

            System.out.print(text);

         }   
         catch (IOException e) 
         {

            e.printStackTrace();

         }

    }

    public static void main(String args[])
    {

      TestJsoup tjs = new TestJsoup();

      tjs.SimpleParse();

    }

}

score 1 · Accepted Answer

別のコードを使用して、この特定のタグからデータを収集しました。

要素の内容=doc.getElementsByTag（ "blockquote"）;

要素k=doc.select（ "[postcontent restore]"）;

content.select（ "blockquote"）。remove（）;

content.select（ "br"）。remove（）;

content.select（ "div"）。remove（）;

content.select（ "a"）。remove（）;

content.select（ "b"）。remove（）;

score 1 · Accepted Answer

body-Element を div と pre タグでラップするのはなぜですか?

title-Element は次のように選択できます。

Document doc = Jsoup.connect("url").get();

Element titleElement = doc.select("title").first();
String titleText = titleElement.text();

// Or shorter ...

String titleText = doc.select("title").first().text();

分割タグ:

// Document 'doc' as above

Elements divTags = doc.select("div");


for( Element element : divTags )
{
    // Do something there ... eg. print each element
    System.out.println(element);

    // Or get the Text of it
    String text = element.text();
}

Jsoup Selector API全体の概要は次のとおりです。これは、必要なあらゆる種類の要素を見つけるのに役立ちます。

html - フォーラムからスレッド ヘッドとスレッドの返信を抽出する

2 に答える 2

Related

Reference

html - フォーラムからスレッドヘッドとスレッドの返信を抽出する