python - Python：エンコーディングエラー-Webページのコンテンツ

Question

mysql dbに保存するよりも、Webページのコンテンツを取得して解析しようとしています。

私は実際にutf8をエンコードするWebページに対してそれを行いました。

しかし、8859-9エンコーディングのWebページを試してみると、エラーが発生します。

ページのコンテンツを取得するための私のコード：

def getcontent(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Magic Browser')]
    opener.addheaders = [('Accept-Charset', 'utf-8')]   
    #print chardet.detect(response).get('encoding)
    response = opener.open(url).read()
    opener.close()
    return response



url     = "http://www.meb.gov.tr/duyurular/index.asp?ID=4"
contentofpage = getcontent(url)
print contentofpage
print chardet.detect(contentofpage)
print contentofpage.encode("utf-8")

ページのコンテンツの出力：...E�itimTeknolojileriGenelM�d�rl��..。

{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'}


Traceback (most recent call last):
  File "meb.py", line 18, in <module>
    print contentofpage.encode("utf-8")
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not     in range(128)

実際、ページはトルコ語のページであり、エンコーディングは8859-9です。

デフォルトのエンコーディングで試してみると、いくつかの文字の代わりに��が表示されます。ページのコンテンツを取得または変換してutf-8またはトルコ語（iso-8859-9）にするにはどうすればよいですか？

unicode（contentofpage）を使用する場合も

取得します

トレースバック（最後の最後の呼び出し）：ファイル "meb.py"、行20、印刷ユニコード（contentofpage）UnicodeDecodeError：'ascii'コーデックは位置458のバイト0xeeをデコードできません：序数が範囲内にありません（128）

何か助けはありますか？

score 4 · Accepted Answer

すでにエンコードされているので、エンコードではなくデコードしたいと思います。

print contentofpage.decode("iso-8859-9")

次のようなサンプルが生成されます。

Eğitim Teknolojileri Genel Müdürlüğü

python - Python：エンコーディングエラー-Webページのコンテンツ

1 に答える 1

Related

Reference