ruby - のこぎりは、書式設定とリンクタグ、、、などでテキストをグラブします。

Question

score 2 · Accepted Answer

必要なコンテンツを抽出する Nokogiri と、不要なタグを削除するか、必要なタグを保持するブラックリスト/ホワイトリストプログラムの 2 つの戦術を使用します。

require 'nokogiri'
require 'sanitize'

html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'

doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

<div id="1">の内容をHTML 文字列としてキャプチャします。

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

末尾</strong></strong>は、2 つの開始<strong>タグの結果です。これは意図的なものかもしれませんが、終了タグがない場合、Nokogiri は HTML を正しくするために何らかの修正を行います。

Sanitize gemhtml_fragmentに渡す:

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

返されるテキストは次のようになります。

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags 

      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em> 

</strong></strong>

この場合も、HTML の形式が正しくなく、終了</strong>タグがないため、末尾に 2 つの終了タグが存在します。

ruby - のこぎりは、書式設定とリンク タグ、、、などでテキストをグラブします。

1 に答える 1

Related

Reference

ruby - のこぎりは、書式設定とリンクタグ、、、などでテキストをグラブします。