python - URL オープンエンコーディング

Question

urllib と BeautifulSoup の次のコードがあります。

getSite = urllib.urlopen(pageName) # open current site   
getSitesoup = BeautifulSoup(getSite.read()) # reading the site content 
print getSitesoup.originalEncoding
for value in getSitesoup.find_all('link'): # extract all <a> tags 
    defLinks.append(value.get('href'))

その結果：

/usr/lib/python2.6/site-packages/bs4/dammit.py:231: UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
  "Some characters could not be decoded, and were "

そして、サイトを読もうとすると、次のようになります。

�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��@�����]���(P��^��q�$�S5���tT*�Z

score 2 · Accepted Answer

ページは UTF-8 ですが、サーバーは圧縮形式で送信しています。

>>> print getSite.headers['content-encoding']
gzip

Beautiful Soup で実行する前に、データを解凍する必要があります。データに対して zlib.decompress() を使用するとエラーが発生しましたが、データをファイルに書き込み、 gzip.open() を使用してデータを読み取ると問題なく動作しました。理由はわかりません。

score 2 · Accepted Answer

BeautifulSoup は内部で Unicode と連携します。デフォルトでは、UTF-8 からの非 Unicode 応答をデコードしようとします。

読み込もうとしているサイトは別のエンコードを使用しているようです。たとえば、代わりに UTF-16 を使用できます。

>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��@�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('utf-16-le')
뿯㞽뿯施뿯붿뿯붿⨰䤢럟뿯䞽뿯䢽뿯붿뿯붿붿뿯붿뿯붿뿯㦽붿뿯붿뿯붿뿯㮽뿯붿붿썙䊞붿뿯붿뿯붿뿯붿뿯붿铣㾶뿯㒽붿뿯붿붿뿯붿뿯붿坞뿯붿뿯붿뿯悽붿敋뿯붿붿뿯⪽붿✮兏붿뿯붿붿뿯䂽뿯붿뿯붿뿯嶽뿯붿뿯⢽붿뿯庽뿯붿붿붿㕓뿯붿뿯璽⩔뿯媽

それもあるかもしれませmac_cyrillicん：

>>> print u"""�7�e����0*"I߷�G�H����F������9-������;��E�YÞBs���������㔶?�4i���)�����^W�����`w�Ke��%��*9�.'OQB���V��@�����]���(P��^��q�$�S5���tT*�Z""".encode('utf-8').decode('mac_cyrillic')
пњљ7пњљeпњљпњљпњљпњљ0*"IяЈпњљGпњљHпњљпњљпњљпњљFпњљпњљпњљпњљпњљпњљ9-пњљпњљпњљпњљпњљпњљ;пњљпњљEпњљY√ЮBsпњљпњљпњљпњљпњљпњљпњљпњљпњљгФґ?пњљ4iпњљпњљпњљ)пњљпњљпњљпњљпњљ^Wпњљпњљпњљпњљпњљ`wпњљKeпњљпњљ%пњљпњљ*9пњљ.'OQBпњљпњљпњљVпњљпњљ@пњљпњљпњљпњљпњљ]пњљпњљпњљ(Pпњљпњљ^пњљпњљqпњљ$пњљS5пњљпњљпњљtT*пњљZ

しかし、あなたが読み込もうとしているサイトの種類に関する情報が少なすぎて、どちらのエンコーディングの出力も読み取ることができません。:-)

getSite()BeautifulSoup に渡す前に、結果をデコードする必要があります。

getSite = urllib.urlopen(pageName).decode('utf-16')

通常、Web サイトは、ヘッダーで使用されたエンコーディングをヘッダーの形式で返しますContent-Type(おそらくtext/html; charset=utf-16、または同様のもの)。

score 1 · Accepted Answer

私も同じ問題に遭遇しました。Leonard が述べたように、それは圧縮形式が原因でした。

この('Accept-Encoding', 'gzip,deflate')リンクは、リクエストヘッダーに追加するように言っている私のためにそれを解決しました。例えば：

opener = urllib2.build_opener()
opener.addheaders = [('Referer', referer),
('User-Agent', uagent),
('Accept-Encoding', 'gzip,deflate')]
usock = opener.open(url)
url = usock.geturl()
data = decode(usock)
usock.close() 
return data

関数decode()は次のように定義されます。

def decode (page):
    encoding = page.info().get("Content-Encoding")    
    if encoding in ('gzip', 'x-gzip', 'deflate'):
        content = page.read()
        if encoding == 'deflate':
            data = StringIO.StringIO(zlib.decompress(content))
        else:
            data = gzip.GzipFile('', 'rb', 9, StringIO.StringIO(content))
        page = data.read()

    return page

python - URL オープン エンコーディング

3 に答える 3

Related

Reference

python - URL オープンエンコーディング