python - Python/Django を使用して HTML のデコード/エンコードを実行するにはどうすればよいですか?

Question

HTML エンコードされた文字列があります。

'''&lt;img class=&quot;size-medium wp-image-113&quot;\
 style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
 src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\
 alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

私はそれを次のように変更したい:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

これを HTML として登録して、ブラウザによってテキストとして表示されるのではなく画像としてレンダリングされるようにします。

と呼ばれるWebスクレイピングツールを使用しているため、文字列はそのように保存されますBeautifulSoup.Webページを「スキャン」し、そこから特定のコンテンツを取得してから、その形式で文字列を返します。

C#ではなくPythonでこれを行う方法を見つけました。誰かが私を助けることができますか？

関連している

Python で XML/HTML エンティティを Unicode 文字列に変換する

score 136 · Accepted Answer

Django の使用例を考えると、これには 2 つの答えがあります。参考までに、そのdjango.utils.html.escape機能は次のとおりです。

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

これを逆にするには、Jake's answer で説明されている Cheetah 関数が機能するはずですが、一重引用符がありません。このバージョンには更新されたタプルが含まれており、対称的な問題を回避するために置換の順序が逆になっています。

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

ただし、これは一般的な解決策ではありません。でエンコードされた文字列にのみ適していますdjango.utils.html.escape。より一般的には、標準ライブラリに固執することをお勧めします。

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

提案として: HTML をエスケープせずにデータベースに保存する方が理にかなっている場合があります。可能であれば、エスケープされていない結果を BeautifulSoup から取得し、このプロセスを完全に回避することを検討する価値があります。

Django では、エスケープはテンプレートのレンダリング中にのみ発生します。したがって、エスケープを防ぐには、テンプレートエンジンに文字列をエスケープしないように指示するだけです。これを行うには、テンプレートで次のオプションのいずれかを使用します。

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

score 134 · Accepted Answer

標準ライブラリの場合:

HTML エスケープ

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

HTML アンエスケープ

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))

score 80 · Accepted Answer

HTML エンコーディングには、標準ライブラリのcgi.escapeがあります。

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

HTMLのデコードには、次を使用します。

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

もっと複雑な場合は、BeautifulSoup を使用します。

score 20 · Accepted Answer

エンコードされた文字のセットが比較的制限されている場合は、ダニエルのソリューションを使用してください。それ以外の場合は、多数の HTML 解析ライブラリのいずれかを使用してください。

私が BeautifulSoup を気に入っているのは、不正な形式の XML/HTML を処理できるからです。

http://www.crummy.com/software/BeautifulSoup/

あなたの質問については、ドキュメントに例があります

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

score 8 · Accepted Answer

このページの下部にあるPythonwikiを参照してください。htmlを「エスケープ解除」するには、少なくとも2つのオプションがあります。

score 5 · Accepted Answer

http://snippets.dzone.com/posts/show/4569で素晴らしい機能を見つけました。

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

score 4 · Accepted Answer

誰かが django テンプレートを介してこれを行う簡単な方法を探している場合は、次のようなフィルターをいつでも使用できます。

<html>
{{ node.description|safe }}
</html>

ベンダーからのデータがいくつかあり、投稿したすべてのものには、ソースを見ているかのようにレンダリングされたページに実際に書かれた html タグがありました。

score 2 · Accepted Answer

Cheetah のソースコードでこれを見つけました ( here )

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

リストを逆にする理由がわかりませんが、エンコード方法に関係していると思うので、逆にする必要はないかもしれません。また、私があなただったら、htmlCodesをリストのリストではなくタプルのリストに変更します...これは私のライブラリに入っています:)

あなたのタイトルもエンコードを要求していることに気付きました。ここに Cheetah のエンコード機能があります。

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

score 1 · Accepted Answer

django.utils.html.escape も使用できます

from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])

score 0 · Accepted Answer

以下は、モジュールを使用するPython関数ですhtmlentitydefs。完璧ではありません。私が持っているバージョンhtmlentitydefsは不完全であり、すべてのエンティティが1つのコードポイントにデコードされることを前提としています。これは次のようなエンティティでは間違ってい&NotEqualTilde;ます。

http://www.w3.org/TR/html5/named-character-references.html

NotEqualTilde;     U+02242 U+00338    ≂̸

ただし、これらの注意事項を踏まえて、ここにコードを示します。

def decodeHtmlText(html):
    """
    Given a string of HTML that would parse to a single text node,
    return the text value of that node.
    """
    # Fast path for common case.
    if html.find("&") < 0: return html
    return re.sub(
        '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
        _decode_html_entity,
        html)

def _decode_html_entity(match):
    """
    Regex replacer that expects hex digits in group 1, or
    decimal digits in group 2, or a named entity in group 3.
    """
    hex_digits = match.group(1)  # '&#10;' -> unichr(10)
    if hex_digits: return unichr(int(hex_digits, 16))
    decimal_digits = match.group(2)  # '&#x10;' -> unichr(0x10)
    if decimal_digits: return unichr(int(decimal_digits, 10))
    name = match.group(3)  # name is 'lt' when '&lt;' was matched.
    if name:
        decoding = (htmlentitydefs.name2codepoint.get(name)
            # Treat &GT; like &gt;.
            # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
            # If htmlentitydefs included mappings for those entities,
            # then this code will magically work.
            or htmlentitydefs.name2codepoint.get(name.lower()))
        if decoding is not None: return unichr(decoding)
    return match.group(0)  # Treat "&noSuchEntity;" as "&noSuchEntity;"

score 0 · Accepted Answer

これは、この問題の最も簡単な解決策です -

{% autoescape on %}
   {{ body }}
{% endautoescape %}

このページから.

python - Python/Django を使用して HTML のデコード/エンコードを実行するにはどうすればよいですか?

関連している

15 に答える 15

Related

Reference