python - ウィキペディアの記事の div(id="BodyContent") 内のテキストをスクレイピングするにはどうすればよいですか。Python の BeautifulSoup と nltk を使用しています

Question

page=nltk.clean_html(soup.findAll('div',id="bodyContent"))

このコードを実行しようとすると、次のように表示されます。

Traceback (most recent call last):
  File "C:\Python27\wiki3.py", line 36, in <module>
    page=nltk.clean_html(soup.findAll('div',id="bodyContent"))
  File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\util.py", line 340, in clean_html
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
AttributeError: 'ResultSet' object has no attribute 'strip'

score 1 · Accepted Answer

clean_html文字列（必要なもの）ではなく、反復可能なBeautifulSoupオブジェクト（これがfindAll返すもの）を提供していますclean_html。

divそれぞれがクリーンアップされた文字列のリストが必要であると仮定すると、次のようにします。

page = [nltk.clean_html(str(d)) for d in soup.findAll('div',id="bodyContent")]

また

page = map(nltk.clean_html, soup.findAll('div',id="bodyContent"))

python - ウィキペディアの記事の div(id="BodyContent") 内のテキストをスクレイピングするにはどうすればよいですか。Python の BeautifulSoup と nltk を使用しています

2 に答える 2

Related

Reference