python - Pythonでhtmlタグ内のテキストを削除するには?

Question

重複の可能性:
Python で文字列から html を削除する

アプリケーションのような小さなブラウザを作成しているときに、さまざまなタグを分割するという問題に直面しています。文字列を考える

<html> <h1> good morning </h1> welcome </html>

次の出力が必要です: ['おはよう','ようこそ']

どうすればPythonでそれを行うことができますか?

score 3 · Accepted Answer

私は使用しますxml.etree.ElementTree：

def get_text(etree):
    for child in etree:
        if child.text:
           yield child.text
        if child.tail:
           yield child.tail

import xml.etree.ElementTree as ET
root = ET.fromstring('<html> <h1> good morning </h1> welcome </html>')
print list(get_text(root))

score 1 · Accepted Answer

pythons html / xml パーサーのいずれかを使用できます。

美しいスープが人気。lmxlも人気です。

上記は、標準ライブラリも使用できるサードパーティのパッケージです

http://docs.python.org/library/xml.etree.elementtree.html

score 0 · Accepted Answer

Beautiful Soupあなたの目標を達成するためにpythonライブラリを使用します。それは、その助けを借りてほんの数行です：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<html> <h1> good morning </h1> welcome </html>')
print [text for text in soup.stripped_strings]

python - Pythonでhtmlタグ内のテキストを削除するには?

3 に答える 3

Related

Reference