python - Pythonの文字列から特定の情報を抽出するにはどうすればよいですか？

Question

Pythonを使用してHTMLコードから特定の情報を抽出しようとしています。例えば：

<a href="#tips">Visit the Useful Tips Section</a> 
and I would like to get result : Visit the Useful Tips Section

<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;">
<b>Menu</b><br />
HTML<br />
CSS<br />
and I would like to get Menu HTML CSS

つまり、<>と<>の間のすべてを取得したいのですが、htmlコードを文字列として受け取るPython関数を作成し、そこから情報を抽出しようとしています。string.split（'<'）でスタックしています。

score 3 · Accepted Answer

HTMLParserモジュールなどの適切なHTML解析ライブラリを使用する必要があります。

score 1 · Accepted Answer

string = '<a href="#tips">Visit the Useful Tips Section</a>'
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'

score 1 · Accepted Answer

lxmlhtml パーサーを使用できます。

>>> import lxml.html as lh
>>> st = ''' load your above html content into a string '''
>>> d = lh.fromstring(st)
>>> d.text_content()

'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would
like to get Menu HTML CSS\n'

またはあなたができる

>>> for content in d.text_content().split("\n"):
...     if content:
...             print content
...
Visit the Useful Tips Section
and I would like to get result : Visit the Useful Tips Section
Menu
HTML
CSS
and I would like to get Menu HTML CSS
>>>

score 0 · Accepted Answer

HTML タグを取り除き、テキストのみを保持しようとしていると理解しています。

タグを表す正規表現を定義できます。次に、すべての一致を空の文字列に置き換えます。

例：

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

参考文献:

例

Python 正規表現に関するドキュメント

score 0 · Accepted Answer

私はBeautifulSoupを使用します- 不正な形式の html で不機嫌になることはあまりありません。

python - Pythonの文字列から特定の情報を抽出するにはどうすればよいですか？

5 に答える 5

Related

Reference