python - Beautifulsoup で findAll を使用して結果をフィルタリングする

Question

import urllib2
from BeautifulSoup import BeautifulSoup

result = urllib2.urlopen("http://www.bbc.co.uk/news/uk-scotland-south-scotland-12380537")
html=result.read()
soup= BeautifulSoup(html)
print soup.html.head.title

print soup.findAll('div', attrs={ "class" : "story-body"})

問題は、私が欲しい情報がストーリー本体にあるが、一番下にあることです。そのため、そこに着くまでに大量のジャンク情報を取得することになります。

print soup.findAll('p', attrs={ 'class' : "introduction"})

<p>この例では、あと 8 つ収集する必要があります。

イントロダクションの最初からストーリーボディの終わりまでを集めたいと思っています...何かアイデアはありますか？

score 1 · Accepted Answer

pCSS セレクタに関しては、内のすべての要素を選択する必要があり.story-bodyます。

print soup.select('.story-body p')

http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html?highlight=select#css-selectors

python - Beautifulsoup で findAll を使用して結果をフィルタリングする

1 に答える 1

Related

Reference