python - Pythonを使用してHTMLドキュメントのテキストを解析する

Question

私はこのようなものを持っていて、<td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>Pythonを使用してテキストを取得する必要があります。

どうすればいいですか？私はそのようなことに全く慣れていません。

score 2 · Accepted Answer

2

私は個人的にBeautifulSoupが大好きです。

于 2012-12-27T15:14:53.557 に答える

score 0 · Accepted Answer

Pythonにはhtmlパーサーモジュールが組み込まれています...

http://docs.python.org/2/library/htmlparser.html

しかし、私はBeautiful Soupをお勧めします (先史時代のホームページに惑わされないでください。とても素晴らしいライブラリです)。

または、 lxmlを試すこともできます。これも非常に優れています。

score 0 · Accepted Answer

Python xml Parser を使用したソリューション

>>> from xml.dom.minidom import parseString
>>> parseString(foo).getElementsByTagName("td")[0].firstChild.nodeValue
u'text I need to get'

BeautifulSOupを使ったソリューション

>>> import BeautifulSoup
>>> BeautifulSoup.BeautifulSoup(foo).getText()
u'text I need to get'

HTMParser を使用したソリューション

>>> from HTMLParser import HTMLParser
>>> class MyHTMLParser(HTMLParser):
    def handle_data(self, data):
        print data          
>>> MyHTMLParser().feed(foo)
text I need to get

正規表現を使用したソリューション

>>> import re
>>> re.findall("<.*?>(.*)<.*?>",foo)[0]
'text I need to get'

score 0 · Accepted Answer

これを試して、

 >>> html='''<td width='370' style='border-left: 1px solid #fff;'>text I need to get</td>'''
 >>> from BeautifulSoup import BeautifulSoup
 >>> ''.join(BeautifulSoup(html).findAll(text=True))
 u'text I need to get'
 >>>

BeautifulSoupを使用したこのソリューションは、

お使いのシステムに BeautifulSoup がインストールされていない場合。このようにインストールできますsudo pip install BeautifulSoup

python - Pythonを使用してHTMLドキュメントのテキストを解析する

4 に答える 4

Related

Reference