python - Python を使用して HTML ファイルからテキストを抽出する

Question

Python を使用して HTML ファイルからテキストを抽出したいと思います。ブラウザからテキストをコピーしてメモ帳に貼り付けた場合と基本的に同じ出力が必要です。

不適切な形式の HTML で失敗する可能性のある正規表現を使用するよりも、より堅牢なものが必要です。多くの人が Beautiful Soup をすすめているのを見てきましたが、私はそれを使用する際にいくつか問題がありました。1 つは、JavaScript ソースなどの不要なテキストを拾い上げたことです。また、HTML エンティティを解釈しませんでした。たとえば、 ' を期待します。ブラウザーのコンテンツをメモ帳に貼り付けたかのように、テキストのアポストロフィに変換される HTML ソース内。

更新 html2textは有望に見えます。HTML エンティティを正しく処理し、JavaScript を無視します。ただし、プレーンテキストを正確に生成するわけではありません。プレーンテキストに変換する必要があるマークダウンを生成します。例やドキュメントはありませんが、コードはきれいに見えます。

関連する質問:

score 197 · Accepted Answer

JavaScriptを取得せずに、または不要なものを取得せずにテキストを抽出するために見つけた最高のコード:

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

次の前に BeautifulSoup をインストールする必要があります。

pip install beautifulsoup4

score 165 · Accepted Answer

165

html2textは、これでかなり良い仕事をするPythonプログラムです。

于 2008-11-30T03:23:58.877 に答える

score 103 · Accepted Answer

注:clean_html NTLK は機能をサポートしなくなりました

以下の元の回答、およびコメントセクションの代替。

NLTKを使用する

html2text の問題を修正するために 4 ～ 5 時間を無駄にしました。幸いなことに、NLTK に遭遇することができました。
それは魔法のように機能します。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

score 55 · Accepted Answer

今日、まったく同じ問題に直面していることに気づきました。私は非常に単純な HTML パーサーを作成して、受信したすべてのマークアップのコンテンツを取り除き、残りのテキストを最小限の書式設定だけで返します。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

score 8 · Accepted Answer

ストリップグラムライブラリでも html2text メソッドを使用できます。

from stripogram import html2text
text = html2text(your_html_string)

ストリップグラムをインストールするには、sudo easy_install stripogram を実行します

score 7 · Accepted Answer

データマイニング用のパターンライブラリがあります。

http://www.clips.ua.ac.be/pages/pattern-web

保持するタグを決定することもできます。

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

score 6 · Accepted Answer

PyParsingは素晴らしい仕事をします。PyParsing wikiが強制終了されたため、ここにPyParsingの使用例がある別の場所があります（リンク例）。pyparsingに少し時間を費やす理由の1つは、彼が非常に簡潔で非常によく整理されたO'ReillyShortCutマニュアルも作成したことです。これも安価です。

そうは言っても、私はBeautifulSoupをよく使用しますが、エンティティの問題に対処するのはそれほど難しくありません。BeautifulSoupを実行する前にそれらを変換できます。

幸運を

score 5 · Accepted Answer

速度を上げて精度を下げる必要がある場合は、生の lxml を使用できます。

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
    doc = lh.fromstring(html)
    doc = clean_html(doc)
    return doc.text_content()

score 4 · Accepted Answer

HTMLParserモジュールの代わりに、htmllibをチェックしてください。インターフェースは似ていますが、より多くの作業を行います。（これはかなり古いので、javascriptとcssを取り除くという点ではあまり役に立ちません。派生クラスを作成することはできますが、start_scriptやend_styleなどの名前のメソッドを追加できます（詳細については、Pythonドキュメントを参照してください）が、難しいです不正な形式のhtmlに対してこれを確実に行うために。）とにかく、プレーンテキストをコンソールに出力する簡単なものがあります。

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

score 4 · Accepted Answer

これは正確にはPythonソリューションではありませんが、Javascriptが生成するテキストをテキストに変換します。これは重要だと思います（例：google.com）。ブラウザのリンク (Lynx ではない) には Javascript エンジンがあり、-dump オプションを使用してソースをテキストに変換します。

したがって、次のようなことができます。

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname], 
                        stdout=subprocess.PIPE,
                        stderr=open('/dev/null','w'))
text = proc.stdout.read()

score 3 · Accepted Answer

美しいスープはhtmlエンティティを変換します。HTMLはバグが多く、Unicodeとhtmlエンコーディングの問題でいっぱいであることを考えると、おそらく最善の策です。これは、htmlを生のテキストに変換するために使用するコードです。

import BeautifulSoup
def getsoup(data, to_unicode=False):
    data = data.replace("&nbsp;", " ")
    # Fixes for bad markup I've seen in the wild.  Remove if not applicable.
    masssage_bad_comments = [
        (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)),
        (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'),
    ]
    myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
    myNewMassage.extend(masssage_bad_comments)
    return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
        convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES 
                    if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

score 3 · Accepted Answer

別の非 Python ソリューション: Libre Office:

soffice --headless --invisible --convert-to txt input1.html

私が他の代替案よりもこれを好む理由は、すべての HTML 段落が、私が探していた単一のテキスト行 (改行なし) に変換されるためです。他の方法では、後処理が必要です。Lynx は素晴らしい出力を生成しますが、まさに私が探していたものではありません。その上、Libre Office を使用して、あらゆる種類の形式から変換できます...

score 3 · Accepted Answer

漂白剤試しbleach.clean(html,tags=[],strip=True)たことある人いますか？それは私のために働いています。

score 2 · Accepted Answer

BeautifulSoupを使用し、スタイルとスクリプトのコンテンツを削除した@PeYoTILの回答は、私にとってはうまくいきませんでした。decomposeの代わりに使用してみましextractたが、それでも機能しませんでした。そこで、タグを使用してテキストをフォーマットし、タグを href リンクに<p>置き換える独自のものを作成しました。<a>本文中のリンクにも対応。テストドキュメントが埋め込まれたこの要点で利用できます。

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
    "Creates a formatted text email message as a string from a rendered html template (page)"
    soup = BeautifulSoup(html, 'html.parser')
    # Ignore anything in head
    body, text = soup.body, []
    for element in body.descendants:
        # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
        if type(element) == NavigableString:
            # We use the assumption that other tags can't be inside a script or style
            if element.parent.name in ('script', 'style'):
                continue

            # remove any multiple and leading/trailing whitespace
            string = ' '.join(element.string.split())
            if string:
                if element.parent.name == 'a':
                    a_tag = element.parent
                    # replace link text with the link
                    string = a_tag['href']
                    # concatenate with any non-empty immediately previous string
                    if (    type(a_tag.previous_sibling) == NavigableString and
                            a_tag.previous_sibling.string.strip() ):
                        text[-1] = text[-1] + ' ' + string
                        continue
                elif element.previous_sibling and element.previous_sibling.name == 'a':
                    text[-1] = text[-1] + ' ' + string
                    continue
                elif element.parent.name == 'p':
                    # Add extra paragraph formatting newline
                    string = '\n' + string
                text += [string]
    doc = '\n'.join(text)
    return doc

score 1 · Accepted Answer

簡単な方法で

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

このコードは、「<」で始まり「>」で終わる html_text のすべての部分を検索し、見つかったすべてを空の文字列に置き換えます。

score 1 · Accepted Answer

Python 2.7.9+ で BeautifulSoup4 を使用した別の例

以下が含まれます：

import urllib2
from bs4 import BeautifulSoup

コード：

def read_website_to_text(url):
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    for script in soup(["script", "style"]):
        script.extract() 
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))

説明:

URL データを html として読み込み (BeautifulSoup を使用)、すべてのスクリプトとスタイル要素を削除し、.get_text() を使用してテキストのみを取得します。複数の行に分割し、それぞれの先頭と末尾のスペースを削除してから、複数の見出しを各行に分割します。次に、text = '\n'.join を使用して空白行を削除し、最終的に認可された utf-8 として返します。

ノート：

これが実行されている一部のシステムでは、SSL の問題により https:// 接続で失敗します。検証をオフにして、その問題を修正できます。修正例: http://blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python < 2.7.9 では、これを実行すると問題が発生する可能性があります
text.encode('utf-8') は奇妙なエンコーディングのままになる可能性があり、代わりに単に str(text) を返したい場合があります。

score 1 · Accepted Answer

多くの人が正規表現を使用して html タグを取り除くことに言及していますが、多くの欠点があります。

例えば：

<p>hello&nbsp;world</p>I love you

次のように解析する必要があります。

Hello world
I love you

これが私が思いついたスニペットです。特定のニーズに合わせてカスタマイズでき、魅力のように機能します

import re
import html
def html2text(htm):
    ret = html.unescape(htm)
    ret = ret.translate({
        8209: ord('-'),
        8220: ord('"'),
        8221: ord('"'),
        160: ord(' '),
    })
    ret = re.sub(r"\s", " ", ret, flags = re.MULTILINE)
    ret = re.sub("<br>|<br />|</p>|</div>|</h\d>", "\n", ret, flags = re.IGNORECASE)
    ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
    ret = re.sub(r"  +", " ", ret)
    return ret

score -1 · Accepted Answer

私はそれをこのようなものにしています。

>>> import requests
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

python - Python を使用して HTML ファイルからテキストを抽出する

34 に答える 34

Related

Reference