python - Pythonを使用してHTMLから読み取り可能なテキストを抽出しますか？

Question

html2text、BeautifulSoupなどのutilsについては知っていますが、問題は、javascriptも抽出してテキストに追加するため、それらを分離するのが難しいことです。

htmlDom = BeautifulSoup(webPage)

htmlDom.findAll(text=True)

または、

from stripogram import html2text
extract = html2text(webPage)

これらは両方とも、ページ上のすべてのJavaScriptも抽出しますが、これは望ましくありません。

ブラウザからコピーできる読みやすいテキストを抽出したかっただけです。

score 6 · Accepted Answer

scriptBeautifulSoupでタグの内容を抽出したくない場合は、

nonscripttags = htmlDom.findAll(lambda t: t.name != 'script', recursive=False)

スクリプト以外のタグであるルートの直接の子を取得します（別のタグはルートの直接の子であるhtmlDom.findAll(recursive=False, text=True)文字列を取得します）。これを再帰的に行う必要があります。例：ジェネレーターとして：

def nonScript(tag):
    return tag.name != 'script'

def getStrings(root):
   for s in root.childGenerator():
     if hasattr(s, 'name'):    # then it's a tag
       if s.name == 'script':  # skip it!
         continue
       for x in getStrings(s): yield x
     else:                     # it's a string!
       yield s

私はchildGenerator（の代わりにfindAll）を使用しているので、すべての子を順番に取得して、独自のフィルタリングを実行できます。

score 0 · Accepted Answer

BeautifulSoupを使用して、これらの線に沿った何か：

def _extract_text(t):
    if not t:
        return ""
    if isinstance(t, (unicode, str)):
        return " ".join(filter(None, t.replace("\n", " ").split(" ")))
    if t.name.lower() == "br": return "\n"
    if t.name.lower() == "script": return "\n"
    return "".join(extract_text(c) for c in t)
def extract_text(t):
    return '\n'.join(x.strip() for x in _extract_text(t).split('\n'))
print extract_text(htmlDom)

score 0 · Accepted Answer

次のような美しいスープのスクリプトタグを削除できます。

for script in soup("script"):
    script.extract()

要素の削除

score 0 · Accepted Answer

やってみよう：

http://code.google.com/p/boilerpipe/

http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/

python - Pythonを使用してHTMLから読み取り可能なテキストを抽出しますか？

4 に答える 4

Related

Reference