python - Pythonでアポストロフィなどをエスケープ解除するにはどうすればよいですか？

Question

私はこのような記号の文字列を持っています：

&#39;

それはどうやらアポストロフィです。

運が悪かったのでsaxutils.unescape（）を試し、urllib.unquote（）を試しました

どうすればこれをデコードできますか？ありがとう！

score 2 · Accepted Answer

この質問をチェックしてください。あなたが探しているのは「htmlエンティティのデコード」です。通常、「htmldecode」のような名前の関数があり、必要な処理を実行します。DjangoとCheetahはどちらも、BeautifulSoupと同様の機能を提供します。

他の答えは、ライブラリを使用したくなく、すべてのエンティティが数値である場合にうまく機能します。

score 2 · Accepted Answer

これを試してください：（ここで見つけました）

from htmlentitydefs import name2codepoint as n2cp
import re

def decode_htmlentities(string):
    """
    Decode HTML entities–hex, decimal, or named–in a string
    @see http://snippets.dzone.com/posts/show/4569

    >>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
    >>> print decode_htmlentities(u).encode('UTF-8')
    E tu vivrai nel terrore - L'aldilà (1981)
    >>> print decode_htmlentities("l&#39;eau")
    l'eau
    >>> print decode_htmlentities("foo &lt; bar")                
    foo < bar
    """
    def substitute_entity(match):
        ent = match.group(3)
        if match.group(1) == "#":
            # decoding by number
            if match.group(2) == '':
                # number is in decimal
                return unichr(int(ent))
            elif match.group(2) == 'x':
                # number is in hex
                return unichr(int('0x'+ent, 16))
        else:
            # they were using a name
            cp = n2cp.get(ent)
            if cp: return unichr(cp)
            else: return match.group()

    entity_re = re.compile(r'&(#?)(x?)(\w+);')
    return entity_re.subn(substitute_entity, string)[0]

score 1 · Accepted Answer

最も堅牢なソリューションは、Python の著名人である Fredrik Lundh によるこの関数のようです。これは最短のソリューションではありませんが、名前付きエンティティと 16 進数および 10 進数のコードを処理します。

python - Pythonでアポストロフィなどをエスケープ解除するにはどうすればよいですか？

3 に答える 3

Related

Reference