html -
ノコギリでタグ間抽出？

Question

Nokogiri を使用して、このサイトから電話番号と住所を抽出しようとしています。どちらも タグ間です。これどうやってするの？

サイトがダウンしている場合に備えて、電話番号と住所を抽出したい HTML の一部を次に示します。

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Alana's Cafe</strong><br>
<em>Cafe/Desserts </em>
<br>
650 348-0417
<br>
1408 Burlingame Ave
<br>
<a href="http://www.alanascafe.com/burlingame.html" target="_blank">http://www.alanascafe.com/burlingame.html</a>

</td><td align="right">
<a href="index.cfm?vid=44885" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Amber Moon Indian Restaurant and Bar</strong><br>
<em>Indian </em>

<br>
1425 Burlingame Ave


</td><td align="right">
<a href="index.cfm?vid=44872" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

score 2 · Accepted Answer

最も簡単なのは次のようなものです：

data = doc.search('em').map{|em| em.search('~ br').map{|br| br.next.text.strip}}
#=> [["650 348-0417", "1408 Burlingame Ave", "http://www.alanascafe.com/burlingame.html"], etc...

つまり、各 em について、後続の各兄弟 br 要素の後にテキストをマップします。

アップデート

それを電話/住所に並べ替えるには、次のようにします。

data.map{|row| {:phone => row[0][/^[\d \(\)-]+$/] ? row.shift : nil, :address => row.shift}}
#=> [{:phone=>"650 348-0417", :address=>"1408 Burlingame Ave"}, etc...

score 1 · Accepted Answer

コード

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://map.burlingamedowntown.org/textdir.cfm?p=1213'))
addresses = doc.xpath('//td[strong][em]/br[3]/following-sibling::text()[1]')
p addresses.map(&:text).map(&:strip)
#=> #=> ["1408 Burlingame Ave", "347 Primrose Rd", "305 California Dr", "1409 Burlingame Avenue", "260 Lorton Ave", "1219 Burlingame Avenue", "1108 Burlingame Avenue", "1212 Donnelly Ave", "1243 Howard Ave", "283 Lorton Avenue", "245 California Drive", "1107 Howard Ave", "1300 Howard Ave", "1216 Burlingame Avenue", "1310 Burlingame Ave", "322 Lorton Avenue", "203 Primrose Dr", "1125 Burlingame Avenue", "327 Lorton Avenue", "1451 Burlingame Ave", "221 Primrose Rd", "1101 Burlingame Ave", "", "1123 Burlingame Avenue", "1407 Burlingame Ave", "1318 Burlingame Avenue", "1213 Burlingame Avenue", "231 Park Road", "246 Lorton Ave", "1453 Burlingame Ave", "1309 Burlingame Avenue", "321 Primrose Road", "", "209 Park Road", "1207 Burlingame Avenue", "1090 Burlingame Avenue", "1223 Donnelly Ave", "243 California Dr", "1080 Howard Ave", "270 Lorton Ave", "1447 Burlingame Ave", "361 California Drive", "1160 Burlingame Avenue", "333 California Drive", "401 Primrose Road", "1100 Burlingame Avenue", "1100Howard Ave #D", "1309 Burlingame Avenue", "220 Lorton Ave", "", "1101 Howard Avenue", "266 Lorton Avenue", "240 Park Rd", "1118 Burlingame Ave", "221 Park Road", "1400 Howard Ave", "225 Primrose Road", "248 Lorton Avenue"]

使い方

HTML は意味的にマークアップされていないため、最初の課題はアドレスを含むエントリだけを見つけることです。<td>ソースを表示すると、それらがページにあることがわかっているので、それから始めます。

//td-ドキュメント内の任意の場所を検索<td>...

ただし、このページには不適切なマークアップがぎっしり詰まっているため、検索を正しいテーブルセルだけに限定する必要があります。この場合、およびタグはすべてのエントリで一貫して使用され、望ましくない他のセルには表示されません。

//td[strong][em]- ...ただし、少なくとも 1 つの子要素と少なくとも 1つの子要素があることを確認し<td>てください...

ここで、3 番目の 要素の後のテキストが必要なので、最初に、 一致する各要素の 3 番目の子だけを選択し<td>ます。

//td[strong][em]/br[3]- ... 次に、子 要素を見つけて、3 番目だけを選択します...

そして、これに続く最初のテキストノードを取得します 。

//td[strong][em]/br[3]/following-sibling::text()[1]- ... の後の兄弟テキストノードをすべて検索し 、最初のものだけを選択します。

これによりNokogiri::XML::Textインスタンスの配列が残るので、この配列をそれぞれの文字列テキストにマップし、最後にその配列を先頭と末尾の空白を取り除いた配列にマップします。これは最速の方法ではありませんが、簡潔かつ明確で、十分に高速です。

電話番号に対して同様のことを行うことは、読者の演習として残されています。

編集：これは、電話番号のないエントリを処理するのに十分な、わずかに堅牢なバリエーションです。

# Make all the `<br>` be real "\r\n".
doc.xpath('//td[strong][em]/br').each{ |br| br.replace("\r\n") }

# Get the text inside each entry
entries = doc.xpath('//td[strong][em]').map(&:text)

# Change the multi-line string into an array of lines
entries = entries.map{ |entry| entry.strip.split(/(?:\r\n)+/).map(&:strip) }

# Find the first line in each that has no letters in it
phones = entries.map{ |entry_lines| entry_lines.grep(/^[^a-z]+$/i).first }

# Find the first line in each that has a string of digits followed by a letter
addresses = entries.map{ |entry_lines| entry_lines.grep(/\d+ [a-z]/i).first }

# Zip and iterate them together
phones.zip(addresses).each do |phone,address|
  puts "For %s call %s" % [address,phone || "-"]
end

#=> For 1408 Burlingame Ave call 650 348-0417
#=> For 1425 Burlingame Ave call -
#=> For 347 Primrose Rd call 650-548-0300
#=> For 305 California Dr call 650 340-8642
#=> For 1409 Burlingame Avenue call 650 348-1204
#=> ...

html - ノコギリでタグ間抽出？

2 に答える 2

コード

使い方

Related

Reference

html -
ノコギリでタグ間抽出？