python - 不正な HTML ページからテキストを抽出するための Python 戦略

Question

任意の html ページからテキストを抽出しようとしています。一部のページ (私が制御できないもの) には、不正な形式の html またはスクリプトが含まれているため、これが困難になっています。また、私は共有ホスティング環境にいるので、任意の python lib をインストールできますが、必要なものをサーバーにインストールすることはできません。

pyparsing と html2text.py も、不正な形式の html ページでは機能しないようです。

URL の例はhttp://apnews.myway.com/article/20091015/D9BB7CGG1.htmlです。

私の現在の実装はおおよそ次のとおりです。

# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s) 
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
    i.extract()    
body = bsoup.body(text=True)
text = ''.join(body) 
# if BeautifulSoup  can't handle it, 
# alter html by trying to find 1st instance of  "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html

Beautifulsoup がまだ機能しない場合は、最初の文字、最後の文字 (コード行のように見えるかどうかを確認するため) を調べるヒューリスティックを使用することに頼ります # < ; そして行のサンプルを取得し、トークンが英単語または数字です. トークンのいくつかが単語または数字である場合, その行はコードであると思います.

機械学習を使用して各行を検査することもできますが、それは少しコストがかかるように思われ、おそらくそれを訓練する必要があり (教師なし学習機械についてはあまり知らないため)、もちろんそれも作成する必要があります。

アドバイス、ツール、戦略は大歓迎です。また、コードが含まれていると判断された行を取得した場合、その行に実際の英語のテキストが少量含まれていても、現在は行全体を破棄するため、後半はかなり厄介であることに気付きました。

score 5 · Accepted Answer

笑わないでください。

class TextFormatter:
    def __init__(self,lynx='/usr/bin/lynx'):
        self.lynx = lynx

    def html2text(self, unicode_html_source):
        "Expects unicode; returns unicode"
        return Popen([self.lynx, 
                      '-assume-charset=UTF-8', 
                      '-display-charset=UTF-8', 
                      '-dump', 
                      '-stdin'], 
                      stdin=PIPE, 
                      stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')

私はあなたがオオヤマネコを持っていることを願っています！

score 0 · Accepted Answer

まあ、それはソリューションがどれだけ優れているかによって異なります。何百もの古い html ページを新しい Web サイトにインポートするという、同様の問題がありました。私は基本的にやった

# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
    u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )

それはうまくいきましたが、もちろん、ドキュメントが非常に悪いため、BSでさえあまり回収できない可能性があります.

score 0 · Accepted Answer

BeautifulSoup は、不正な形式の HTML でうまく機能しません。いくつかの正規表現はどうですか？

>>> import re
>>> 
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>> 
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'

次に、情報を抽出する有効なタグのリストを作成できます。

python - 不正な HTML ページからテキストを抽出するための Python 戦略

3 に答える 3

Related

Reference