python - html 要素で text_content() を使用するときにランオンワードを回避する確実な方法

Question

ウェブページを解析しています。1 つの目標は、すべての単語とその頻度を見つけることです。lxmlを使用しています

from lxml import html

my_string = open(some_file_path).read()

tree = html.fromstring(my_string)

text_no_markup = tree.text_content()

このようなものを見るでしょう a_wordconcatenated_to_another

a_word concatenated_to_another を期待していたとき

よく見ると、これは、a_word の後にある種の終了タグが続き、さらに html マークアップが続き、スペースや改行なしで concatenated_to_another が何らかのマークアップで囲まれている場合に発生するようです。

これを修正するために私が見つけた唯一の方法は、

my_modified_string = open(some_file_path).read().replace('>','> ')

したがって、すべての gt 記号を gt 記号とスペースに置き換えます。

これを達成するためのより堅牢な方法はありますか?

score 2 · Accepted Answer

使用するitertext()

>>> my_string = '''
... <div>
...     <b>hello</b>world
... </div>
... '''
>>>
>>> root = html.fromstring(my_string)
>>> print root.text_content()

    helloworld

>>> for text in root.itertext():
...     text = text.strip()
...     if text: # to skip empty(or space-only) string
...         print text
...
hello
world
>>> print ' '.join(root.itertext())

     hello world

python - html 要素で text_content() を使用するときにランオン ワードを回避する確実な方法

1 に答える 1

Related

Reference

python - html 要素で text_content() を使用するときにランオンワードを回避する確実な方法