python - BeautifulSoup4を使用して""のようなHTMLエンティティを置き換えたり削除したりするにはどうすればよいですか？

Question

 PythonとBeautifulSoup4ライブラリを使用してHTMLを処理していますが、スペースに置き換える明確な方法が見つかりません。代わりに、Unicodeの改行なしスペース文字に変換されているようです。

明らかな何かが欠けていますか？＆nbsp;を置き換える最良の方法は何ですか BeautifulSoupを使用して通常のスペースで？

編集して、最新バージョンのBeautifulSoup 4を使用していることを追加します。そのため、convertEntities=BeautifulSoup.HTML_ENTITIESBeautifulSoup3のオプションは使用できません。

score 29 · Accepted Answer

>>> soup = BeautifulSoup('<div>a&nbsp;b</div>')
>>> soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
u'<html>\n <body>\n  <div>\n   a b\n  </div>\n </body>\n</html>'

score 17 · Accepted Answer

ドキュメントのエンティティを参照してください。BeautifulSoup 4は、すべてのエンティティに対して適切なUnicodeを生成します。

着信HTMLまたはXMLエンティティは、常に対応するUnicode文字に変換されます。

はい、 ノーブレークスペースの文字になります。代わりにそれらを本当にスペース文字にしたい場合は、Unicode置換を行う必要があります。

score 13 · Accepted Answer

ノーブレークスペースのUnicodeを通常のスペースに置き換えるだけです。

nonBreakSpace = u'\xa0'
soup = soup.replace(nonBreakSpace, ' ')

利点は、BeautifulSoupを使用している場合でも、使用する必要がないことです。

score 3 · Accepted Answer

soup.prettify（）が修正されないというjsonの問題があったため、unicodedata.normalize（）で機能しました：

import unicodedata
soup = BeautifulSoup(r.text, 'html.parser')
dat = soup.find('span', attrs={'class': 'date'})
print(f"date prints fine:'{dat.text}'")
print(f"json:{json.dumps(dat.text)}")
mydate = unicodedata.normalize("NFKD",dat.text)
print(f"json after normalizing:'{json.dumps(mydate)}'")

date prints fine:'03 Nov 19 17:51'
json:"03\u00a0Nov\u00a019\u00a017:51"
json after normalizing:'"03 Nov 19 17:51"'

score 2 · Accepted Answer

確かに、これはBeautifulSoupを使用していませんが、今日のより簡単な解決策は、データと正確に実行したいことによってはhtml.unescape、との組み合わせである可能性があります。unicodedata.normalize

>>> from html import unescape
>>> s = unescape('An enthusiastic member of the&nbsp;community')# Using the import here
>>> print(s)
>>> 'An enthusiastic member of the\xa0community'
>>> import unicodedata
>>> s = unicodedata.normalize('NFKC', s)
>>> print(s)
>>> 'An enthusiastic member of the community'

python - BeautifulSoup4を使用して""のようなHTMLエンティティを置き換えたり削除したりするにはどうすればよいですか？

5 に答える 5

Related

Reference