python - Python文字列でHTMLエンティティをデコードしますか?

Question

Beautiful Soup 3 でいくつかの HTML を解析していますが、Beautiful Soup 3 が自動的にデコードしない HTML エンティティが含まれています。

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

の HTML エンティティをデコードしての代わりにtext取得するにはどうすればよいですか。"£682m""£682m"

score 647 · Accepted Answer

Python 3.4+

使用html.unescape():

import html
print(html.unescape('&pound;682m'))

FYIhtml.parser.HTMLParser.unescapeは非推奨であり、誤って残されていましたが、 3.5 で削除されるはずでした。まもなく言語から削除されます。

パイソン 2.6-3.3

HTMLParser.unescape()標準ライブラリから使用できます：

Python 2.6-2.7 の場合はHTMLParser
Python 3 の場合はhtml.parser

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

six互換性ライブラリを使用して、インポートを簡素化することもできます。

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

score 71 · Accepted Answer

Beautiful Soup はエンティティ変換を処理します。Beautiful Soup 3 では、コンストラクタにconvertEntities引数を指定する必要があります(アーカイブドキュメントの「エンティティ変換」セクションを参照してください)。Beautiful Soup 4 では、エンティティは自動的にデコードされます。BeautifulSoup

美しいスープ 3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

美しいスープ 4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

score 6 · Accepted Answer

Beautiful Soup 4 を使用すると、出力にフォーマッターを設定できます

を渡すとformatter=None、Beautiful Soup は出力時に文字列をまったく変更しません。これは最速のオプションですが、次の例のように、Beautiful Soup が無効な HTML/XML を生成する可能性があります。

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

score -5 · Accepted Answer

これはおそらくここでは関係ありません。しかし、ドキュメント全体からこれらの html エンティティを削除するには、次のようにすることができます: (ドキュメント = ページと仮定し、ずさんなコードを許してください。これ）。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

python - Python文字列でHTMLエンティティをデコードしますか?

6 に答える 6

Python 3.4+

パイソン 2.6-3.3

美しいスープ 3

美しいスープ 4

Related

Reference