html-parsing - HTML の解析とテキストの抽出

Question

HTML ページを解析してテキストコンテンツを抽出するためのリソースが多数あります。Jsoup がその例です。私の場合、各文が発生する html タグでタグ付けされたテキストコンテンツを抽出したいと考えています。たとえば、このページを見てください

<html>
<head><title>Test Page</title>
<body>
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
</body>
</html>

出力は次のようになると予想しています。

<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.

つまり、ページのテキストコンテンツ内に特定の html タグを含めたいと考えています。

score 0 · Accepted Answer

あなたは2つのステップでそれを行います。まず、説明したように、JSoupを使用してDOMツリーを作成します。次に、XSLフィルターを使用して処理します。XSLフィルターでは、関心のあるタグのみを抽出できます。

score 0 · Accepted Answer

結果を取得するには、これを使用できます。

final String html = "<html>"
        + "<head><title>Test Page</title>"
        + "<body>"
        + "<h1>This is a test page</h1>"
        + "<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages."
        + "</body>"
        + "</html>";

// Parse the String into a Jsoup Document
Document doc = Jsoup.parse(html);
Elements body = doc.body().children();

// Do further things here ...
System.out.println(body);

String の代わりにhtml、ファイルや Web サイトを読み込むこともできます - jsoup はこれらすべてを提供します。

この例bodyには、結果として投稿した html が含まれています。

または、「h1 の後に p タグ」のようなものを選択する必要がありますか?

ただし、Jsoup Selector APIをご覧ください。

html-parsing - HTML の解析とテキストの抽出

2 に答える 2

Related

Reference