python - PythonでHTMLの「クラス」属性を反復処理していますか?

Question

Web サイトからの HTML 文字列があります。以下は、その中に存在するものの一部です。

<p class="news-body">
<a href="/ci/content/player/45568.html" target="new">Paul Harris,</a> the South African spinner, is to retire at the end of the season, bringing to an end a 14-year first-class career.
</p>
<p class="news-body">
 Harris played 37 Tests for South Africa with his slow-left arm but nearly turned his back on international cricket after a stint as a Kolpak with Warwickshire in 2006. The retirement of Nicky Boje prompted Harris' eventual call-up and he went on to take 103 wickets at 37.87.
</p>
<p class="news-body">
His last Test was in Cape Town against India in January 2011 after which he was dropped for legspinner Imran Tahir. As recently as the start of this season he indicated his intention to compete for a Test place once again.
</p>  </div>
   <!-- body area ends here  -->

ALL of 内に存在する上記のテキストをすべて抽出したいと思います<p class="news-body">。

ビューティフルスープを使用しました。

from BeautifulSoup import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print parsed_html.body.find('p', attrs={'class':'news-body'}).text

残念ながら、上記は次の最初の行のみを返します。

Paul Harris,the South African spinner, is to retire at the end of the season, bringing to an end a 14-year first-class career.

すべてのテキストを返したいと思います。

score 1 · Accepted Answer

find最初の要素のみを検索します。findAll要素のリストを返すが必要です。

次のようにテキストを結合できます。

text = '\n'.join(element.text for element in soup.findAll('p', ...))

また、BeautifulSoup の最新バージョンにアップグレードすることをお勧めします。

python - PythonでHTMLの「クラス」属性を反復処理していますか?

1 に答える 1

Related

Reference