python - Beautiful Soupを使用してHTMLドキュメントからプレーンテキストとURLを取得するにはどうすればよいですか？

Question

私はPythonと正規表現を使用してHTMLドキュメントを検索していましたが、ほとんどの人が言うこととは異なり、問題が発生する可能性はありますが、完全に機能していました。とにかく、Beautiful Soupの方が速くて簡単だと思いましたが、正規表現でやったことをどうやってやるのか本当にわかりません。かなり簡単でしたが、面倒でした。

このページのHTMLを使用しています：

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

編集：

主な場所のHTMLは次のとおりです。

<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>

最初の同様の場所の例：

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">

私のプログラムがそれを取得し、Beautiful Soupを使用して読みやすくすると、HTMLはFirefoxの「ソースの表示」とは少し異なります...理由はわかりません。

これらは私の正規表現でした：

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main)

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main)

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main)

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

最初の2つは、メインの場所と住所です。残りは残りの場所の情報のためです。これらを作成した後、91などは必要ないため、cNames、cAddresses、およびcURLの最初の5つの結果のみが必要であると判断しました。

BSでこのような情報を見つける方法がわかりません。BSでできることは、特定のタグを見つけてそれを使って行うことだけです。このHTMLは、すべての情報があるため、少し複雑です。私が欲しいのはテーブルにあり、テーブルタグもちょっと混乱しています...

どのようにしてその情報を取得し、最初の5件程度に限定しますか？

ありがとう。

score 3 · Accepted Answer

ある理由で正規表現を使用してHTMLを解析できないと言われていますが、これが正規表現に当てはまる単純な理由です。\n正規 表現を使用している場合、それらはページ上でランダムに変更される可能性があります。解析しようとしています。その場合、正規表現は一致せず、コードは機能しなくなります。

しかし、あなたがやろうとしているタスクは本当に簡単です

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('this-stackoverflow-page.html'))

for anchor in soup('a'):
    print anchor.contents, anchor.get('href')

このページの深くネストされた構造のどこに表示されていても、すべてのアンカータグが生成されます。その3行のスクリプトの出力から抜粋した行は次のとおりです。

[u'Stack Exchange'] http://stackexchange.com
[u'msw'] /users/282912/msw
[u'faq'] /faq
[u'Stack Overflow'] /
[u'Questions'] /questions
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
[u'python'] /questions/tagged/python
[u'beautifulsoup'] /questions/tagged/beautifulsoup
[u'Marcus Johnson'] /users/1587751/marcus-johnson

あなたのためにそれだけの仕事をすることができるより少ないコードを想像するのは難しいです。

python - Beautiful Soupを使用してHTMLドキュメントからプレーンテキストとURLを取得するにはどうすればよいですか？

1 に答える 1

Related

Reference