ruby -
タグ間のテキストを抽出

Question

URLを抽出するために、私は以下を使用しています:

html = open('http://lab/links.html')
urls = URI.extract(html)

これはうまくいきます。

 ここで、タグの間にあるプレフィックス http または https のない URL のリストを抽出する必要があります。http または https タグがないため、URI.extract は機能しません。

domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php

プレフィックスのない URL はそれぞれ タグの間にあります。

~~<TD> と 内の の後のテキストを取得するために、この Nokogiri Xpath を見てきましたが、機能しませんでした。~~

出力

domain1.com/index.html
domain2.com/home/~john/index.html
domain3.com/a/b/c/d/index.php

~~中間ソリューション~~

~~doc = Nokogiri::HTML(open("http://lab/noprefix_domains.html")) doc.search('br').each do |n| n.replace("\n") end puts doc~~

~~残りの HTML タグ ( !DOCTYPE, html, body, p) を削除する必要があります...~~

解決

str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
puts str.split /\s*<\s*br\s*>\s*/

ありがとう。

score 2 · Accepted Answer

質問で示したサンプル文字列を抽出する方法が既にあると仮定すると、split文字列で次のように使用できます。

str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php"
str.split /\s*<\s*br\s*>\s*/
#=> ["domain1.com/index.html", 
#    "domain2.com/home/~john/index.html",
#    "domain3.com/a/b/c/d/index.php"]

 これにより、すべてのタグで文字列が分割されます。また、前後の空白を削除し、タグ 内の空白を許可します。自己終了タグも処理する必要がある場合 (例: )、代わりに次の正規表現を使用します。 

/\s*<\s*br\s*\/?\s*>\s*/

ruby - タグ間のテキストを抽出

1 に答える 1

Related

Reference

ruby -
タグ間のテキストを抽出