python - Python 3.1 で文字列内の HTML エンティティをエスケープ解除するにはどうすればよいですか?

Question

私は周りを見回して、python 2.6以前の解決策しか見つけられませんでした。python 3.Xでこれを行う方法については何もありません。(私は Win7 ボックスにしかアクセスできません。)

私はこれを 3.1 で、できれば外部ライブラリなしで実行できる必要があります。現在、httplib2 がインストールされており、コマンドプロンプトの curl にアクセスできます (これが、ページのソースコードを取得する方法です)。残念ながら、curl は html エンティティをデコードしません。私が知る限り、ドキュメントでそれをデコードするコマンドを見つけることができませんでした。

はい、私は Beautiful Soup を動作させようとしましたが、3.X では何度も成功しませんでした。MS Windows環境のpython 3で動作させる方法について明示的な指示を提供できれば、非常に感謝しています。

つまり、明確にするために、次のような文字列を次Suzy & Johnのような文字列に変換する必要があります: "Suzy & John".

score 216 · Accepted Answer

関数html.unescapeを使用できます。

Python3.4 +の場合(JF Sebastian の更新に感謝):

import html
html.unescape('Suzy &amp; John')
# 'Suzy & John'

html.unescape('&quot;')
# '"'

Python3.3 以前の場合:

import html.parser    
html.parser.HTMLParser().unescape('Suzy &amp; John')

Python2の場合:

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

score 15 · Accepted Answer

xml.sax.saxutils.unescapeこの目的で使用できます。このモジュールは Python 標準ライブラリに含まれており、Python 2.x と Python 3.x の間で移植可能です。

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'

score 8 · Accepted Answer

どうやら私はこれを投稿する以外に何かをするのに十分な評判がありません。unutbuの答えは引用符をエスケープ解除しません。私が見つけた唯一のことはこの関数でした：

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]

私はこのページから得ました。

score 3 · Accepted Answer

3

Python 3.x にもhtml.entitiesがあります

于 2010-03-02T03:01:41.623 に答える

score 2 · Accepted Answer

私の場合、as3エスケープ関数でエスケープされたhtml文字列があります。1時間のグーグル検索の後、有用なものが何も見つからなかったので、私はこの再帰関数を書いて私のニーズに応えました。ここにあります、

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1 Unicode 文字を処理する機能が追加されました。

score 1 · Accepted Answer

これが組み込みライブラリかどうかはわかりませんが、必要なもののように見え、3.1 をサポートしています。

から: http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape(data, entities={}) データの文字列内の '&'、'<'、および '>' をアンエスケープします。

python - Python 3.1 で文字列内の HTML エンティティをエスケープ解除するにはどうすればよいですか?

6 に答える 6

Related

Reference