1 に答える
Given a Unicode string containing a single character:
symbol = u'★'
It can be converted to the HTML syntax like so:
html = '&#{};'.format(ord(symbol))
To convert back, extract the number by stripping off the &#
and ;
, convert to an integer, and then use chr
(Python 3) or unichr
(Python 2).
If you need to deal with input not from the above conversion, you may need to deal with hexadecimal ones, too, which look like &#xZZZ;
where ZZZ
is a bunch of hexadecimal digits. To detect these, simply notice that it starts with an x
; parse the remainder with radix 16.
Furthermore, you may need to deal with named entities. See the last two paragraphs for that.
If you want Python to deal with the encoding of a whole string, you can use this:
text = u"I like symb★ls!"
html = text.encode('ascii', errors='xmlcharrefreplace').decode('ascii')
Unfortunately, there is no equivalent for decoding, and this also does not escape potentially hazardous HTML characters such as <
(which may or may not be what you want). If you need to decode, perhaps use a proper HTML parser which will also be able to deal with named entities like ♣
(♣).
If you want to deal with named entities and do not want to use a real HTML parser, there is a machine-readable (with Python's json
module) list of entities.