python - Python で HTTP 応答の文字セット/エンコーディングを取得する良い方法

Question

Python urllib2 またはその他の Python ライブラリを使用して、HTTP 応答の文字セット/エンコーディング情報を取得する簡単な方法を探しています。

>>> url = 'http://some.url.value'
>>> request = urllib2.Request(url)
>>> conn = urllib2.urlopen(request)
>>> response_encoding = ?

「Content-Type」ヘッダーに存在する場合があることは知っていますが、そのヘッダーには他の情報が含まれており、解析する必要がある文字列に埋め込まれています。たとえば、Google から返される Content-Type ヘッダーは次のとおりです。

>>> conn.headers.getheader('content-type')
'text/html; charset=utf-8'

私はそれで作業できますが、フォーマットがどの程度一貫しているかはわかりません。文字セットが完全に欠落している可能性があると確信しているので、そのエッジケースを処理する必要があります。「utf-8」を取得するためのある種の文字列分割操作は、この種のことを行うには間違った方法でなければならないようです。

>>> content_type_header = conn.headers.getheader('content-type')
>>> if '=' in content_type_header:
>>>  charset = content_type_header.split('=')[1]

それは、あまりにも多くの作業を行っているように感じる種類のコードです。また、すべての場合に機能するかどうかもわかりません。誰かがこれを行うためのより良い方法を持っていますか?

score 7 · Accepted Answer

Flask / Werkzeug Web 開発スタックに精通している場合は、Werkzeug ライブラリがまさにこの種の HTTP ヘッダー解析に対する回答を持っていることを知って喜んでいるでしょう。すべて、あなたが望んでいたように。

 >>> from werkzeug.http import parse_options_header
 >>> import requests
 >>> url = 'http://some.url.value'
 >>> resp = requests.get(url)
 >>> if resp.status_code is requests.codes.ok:
 ...     content_type_header = resp.headers.get('content_type')
 ...     print content_type_header
 'text/html; charset=utf-8'
 >>> parse_options_header(content_type_header) 
 ('text/html', {'charset': 'utf-8'})

したがって、次のことができます。

 >>> content_type_header[1].get('charset')
 'utf-8'

charsetが指定されていない場合は、代わりに以下が生成されることに注意してください。

 >>> parse_options_header('text/html')
 ('text/html', {})

空の文字列または辞書以外を指定しない場合でも機能します。

 >>> parse_options_header({})
 ('', {})
 >>> parse_options_header('')
 ('', {})

したがって、それはまさにあなたが探していたもののようです! ソースコードを見ると、目的を念頭に置いていることがわかります: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/http.py#L320-329

def parse_options_header(value):
    """Parse a ``Content-Type`` like header into a tuple with the content
    type and the options:
    >>> parse_options_header('text/html; charset=utf8')
    ('text/html', {'charset': 'utf8'})
    This should not be used to parse ``Cache-Control`` like headers that use
    a slightly different format.  For these headers use the
    :func:`parse_dict_header` function.
    ...

これがいつか誰かを助けることを願っています! :)

score 5 · Accepted Answer

requestsライブラリはこれを簡単にします:

>>> import requests
>>> r = requests.get('http://some.url.value')
>>> r.encoding
'utf-8' # e.g.

score 3 · Accepted Answer

文字セットはさまざまな方法で指定できますが、ヘッダーで指定することがよくあります。

>>> urlopen('http://www.python.org/').info().get_content_charset()
'utf-8'
>>> urlopen('http://www.google.com/').info().get_content_charset()
'iso-8859-1'
>>> urlopen('http://www.python.com/').info().get_content_charset()
>>>

その最後のものはどこにも文字セットを指定しなかったので、get_content_charset()返されNoneました。

score 0 · Accepted Answer

これは私にとって完璧に機能するものです。私はpython 2.7と3.4を使用しています

print (text.encode('cp850','replace'))

python - Python で HTTP 応答の文字セット/エンコーディングを取得する良い方法

6 に答える 6

Related

Reference