ruby -
Rails+Nokogiri間のテキストの取得

Question

次の部分的なHTMLについて、「調査を実施しています...治療法を見つけてください！」というテキストを取得しようとしています。のこぎり経由で2つの<br>タグの間。

<b>Multiple Sclerosis National Research Institute</b><br>
<!-- <b>CFC Code: 12135</b><br />     ***** This is edited by Anas -->
<a href="http://www.ms-research.org" target="_blank">http://www.ms-research.org</a><br> 
(866)-676-7400<br> 
Conducts research towards understanding, treating and halting the progression of multiple sclerosis and related diseases. Current research progress is promising. Please help us find cures!<br>
<a href="/ntn/charities/view.aspx?record_id=510">Click here for more info</a><br><br>

これまでのところ、私はこのコードnameでとを取得することができました：url

url = "https://www.neighbortonation.org/ntn/charities/home.aspx"    
doc = Nokogiri::HTML(open(url))

doc.css("#site-pagecontent table table td").each do |item|
    name = item.at_css("b").text unless item.at_css("b").blank?
    url = item.at_css("a")[:href] unless item.at_css("a").blank?
end

<br>しかし、特定のタグ間のテキストを取得しようとして立ち往生しました。Nokogiriで<br>タグ間を抽出して提案を試しましたか？しかし、それはうまくいかなかったようです。何か案は？xpath、search、またはregexを使用する必要がありますか？

score 3 · Accepted Answer

XMLの「要素間のテキスト」について話すときは、XMLのテキストがテキストノードに保持されていることを覚えておくと役立ちます。のこぎりでは、これはNokogiri::XML::Textインスタンスです。

たとえば、次のHTMLは次のとおりです。

<p>Hello <b>World</b>!</p>

最も単純に表されます：

(Element name:"p" children:[
  (Text content:"Hello ")
  (Element name:"b" children:[
    (Text content:"World")
  ])
  (Text content:"!")
])

<p>要素には3つの子ノードがあります。多くの場合、これを覚えておく必要はありません。子または子孫であるテキストについて疑問に思い、要素を見つけてから、.textメソッドを使用して文字列を返すためです。

あなたの場合、近くの要素を見つけるための最も信頼できる方法を見つけたいと思います。<a href="...">Click here for more info</a>が常に存在し、必要なテキストがその直前にあると仮定しましょう。

# Find an <a> element with specific text content
info = doc.at_xpath('//a[.="Click here for more info"]')

# Walk back to the previous element, which we assume is an always-present <br>
br   = info.previous_element

# Find the Text node immediately preceding that, and then get its contents
desc = br.previous.text

XPathを使用すると、これをより効率的かつ簡潔に行うことができますが、Rubyプログラマーが理解するのは難しくなります。

p doc.at('//a[.="Click here for more info"]/preceding-sibling::text()[1]').text
#=> " \nConducts research towards understanding, treating and halting the ...

上記はアンカーを検索し、XPathを使用して先行するすべてのテキストノードを検索し、最初のテキストノードのみを選択します。

score 2 · Accepted Answer

これはどう：

html = '<b>Multiple Sclerosis National Research Institute</b><br> ...'
doc = Nokogiri::HTML(html)
doc.css('br')[2].next.text.strip
#=> "Conducts research towards understanding, treating and halting the progression of multiple sclerosis and related diseases. Current research progress is promising. Please help us find cures!"

そしてライブコンテンツで：

url = "https://www.neighbortonation.org/ntn/charities/home.aspx"    
doc = Nokogiri::HTML(open(url))

doc.css("#site-pagecontent table table td").each do |item|
  description = item.css('br')[2].next.text.strip unless item.css('br').empty?
  ...
end

ruby - Rails+Nokogiri間のテキストの取得

2 に答える 2

Related

Reference

ruby -
Rails+Nokogiri間のテキストの取得