python - utf-8でページを処理する

Question

urllib2とutf-8のページで遊んでいます。

http://www.columbia.edu/~fdc/utf8/

最初の700バイトのみを取得（最上位セグメント）

>>> import urllib2
>>> from urllib2 import HTTPError, URLError
>>> import BaseHTTPServer
>>> opener = urllib2.OpenerDirector()
>>> opener.add_handler(urllib2.HTTPHandler())
>>> opener.add_handler(urllib2.HTTPDefaultErrorHandler())
>>> response = opener.open('http://www.columbia.edu/~fdc/utf8/')
>>> content = response.read(700)

ここから、content varの文字列はutf-8でエンコードされ、かなりうまく表示されるはずだと思います。

でも

>>> content
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n<BASE href="http://kermit.columbia.edu">\n<META http-equiv="Content-Type" content="text/html; charset=utf-8">\n<title>UTF-8 Sampler</title>\n</head>\n<body bgcolor="#ffffff" text="#000000">\n<h1><tt>UTF-8 SAMPLER</tt></h1>\n\n<big><big>&nbsp;&nbsp;\xc2\xa5&nbsp;\xc2\xb7&nbsp;\xc2\xa3&nbsp;\xc2\xb7&nbsp;\xe2\x82\xac&nbsp;\xc2\xb7&nbsp;$&nbsp;\xc2\xb7&nbsp;\xc2\xa2&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa1&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa2&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa3&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa4&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa5&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa6&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa7&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa8&nbsp;\xc2\xb7&nbsp;\xe2\x82\xa9&nbsp;\xc2\xb7&nbsp;\xe2\x82\xaa&nbsp;\xc2\xb7&nbsp;\xe2\x82\xab&nbsp;\xc2\xb7&nbsp;\xe2\x82\xad&nbsp;\xc2\xb7&nbsp;\xe2\x82\xae&nbsp;\xc2\xb7&nbsp;\xe2\x82\xaf&nbsp;\xc2\xb7&nbsp;&#8377</big></big>\n\n\n\n<p>\n<blockquote>\nFrank da Cruz<br>\n<a hre'

HTMLがエスケープされたようですので、

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape(content)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

わかりません。.encode（'utf-8'）をエスケープせずに実行しようとしましたが、同様のエラーが発生しました。

Webサイトからutf-8コンテンツを表示するための最良の方法は何ですか？

score 3 · Accepted Answer

ページをUTF-8からUnicodeにデコードする必要があります。そこにはUTF-8シーケンスがあります（改行なしのHTMLエンティティの隣）：

>>> print h.unescape(content.decode('utf8'))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · &#8377</big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

エンコードとデコードが混乱しました。コンテンツはすでにUTF-8でエンコードされています。

&#8377これはページ自体のエラーであり、;省略されていることに注意してください。;HTML5パーサーまたはブラウザーは、おそらく、を追加してデコードできると想定します。

>>> print h.unescape('&#8377;')
₹

最初に正規表現でこれらのエンティティを修正する必要があります。

>>> import re
>>> brokenrefs = re.compile(r'(&#x?[a-e0-9]+)\b', re.I)
>>> print h.unescape(brokenrefs.sub(r'\1;', content.decode('utf8')))
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹&lt;/big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

score 0 · Accepted Answer

あなたは自分の出力を誤解しました。HTMLでエンコードされたものは何もありませんがcontent、REPLに入力するだけrepr()で、テキストの-edバージョンが表示されます。

行うことprint contentはあなたが期待するものをあなたに与えます：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<BASE href="http://kermit.columbia.edu">
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>UTF-8 Sampler</title>
</head>
<body bgcolor="#ffffff" text="#000000">
<h1><tt>UTF-8 SAMPLER</tt></h1>

<big><big>&nbsp;&nbsp;¥&nbsp;·&nbsp;£&nbsp;·&nbsp;€&amp;nbsp;·&nbsp;$&nbsp;·&nbsp;¢&nbsp;·&nbsp;₡&amp;nbsp;·&nbsp;₢&amp;nbsp;·&nbsp;₣&amp;nbsp;·&nbsp;₤&amp;nbsp;·&nbsp;₥&amp;nbsp;·&nbsp;₦&amp;nbsp;·&nbsp;₧&amp;nbsp;·&nbsp;₨&amp;nbsp;·&nbsp;₩&amp;nbsp;·&nbsp;₪&amp;nbsp;·&nbsp;₫&amp;nbsp;·&nbsp;₭&amp;nbsp;·&nbsp;₮&amp;nbsp;·&nbsp;₯&amp;nbsp;·&nbsp;&#8377</big></big>



<p>
<blockquote>
Frank da Cruz<br>
<a hre

python - utf-8でページを処理する

2 に答える 2

Related

Reference