python - lxmlを使用したPythonでのエンコーディング-複雑なソリューション

Question

lxmlを使用してWebページをダウンロードして解析し、UTF-8xml出力を作成する必要があります。擬似コードのスキーマはもっとわかりやすいと思います。

from lxml import etree

webfile = urllib2.urlopen(url)
root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True))

txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8))


output = etree.Element("out")
output.text = txt

outputfile.write(etree.tostring(output, encoding=utf8))

したがって、webfileは任意のエンコーディングにすることができます（lxmlがこれを処理する必要があります）。出力ファイルはutf-8である必要があります。どこでエンコーディング/コーディングを使用するかわかりません。このスキーマは大丈夫ですか？（lxmlとエンコーディングに関する優れたチュートリアルは見つかりませんが、これには多くの問題があります...）堅牢なソリューションが必要です。

編集：

したがって、utf-8をlxmlに送信するには、

        converted = UnicodeDammit(webfile, isHTML=True)
        if not converted.unicode:
            print "ERR. UnicodeDammit failed to detect encoding, tried [%s]", \
                ', '.join(converted.triedEncodings)
            continue
        webfile = converted.unicode.encode('utf-8')

score 19 · Accepted Answer

lxmlは、入力エンコーディングについて少し不安定な場合があります。UTF8を送信し、UTF8を出力するのが最善です。

chardetモジュールまたはUnicodeDammitを使用して、実際のデータをデコードすることをお勧めします。

あなたは漠然と次のようなことをしたいと思うでしょう：

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

すでにetreeを使用している別のライブラリを操作しているのでない限り、なぜlxmlとetreeの間を移動しているのかわかりません。

score 2 · Accepted Answer

lxmlエンコーディングの検出は弱いです。

ただし、Webページで最も一般的な問題は、エンコーディング宣言がない（または正しくない）ことであることに注意してください。したがって、UnicodeDammitと呼ばれるBeautifulSoupのエンコーディング検出のみを使用し、残りは数倍高速なlxml独自のHTMLパーサーに任せるだけで十分な場合がよくあります。

UnicodeDammitを使用してエンコードを検出し、 lxmlを使用して解析することをお勧めします。また、httpヘッダーのContent-Type （ charset = ENCODING_NAMEを抽出する必要があります）を使用して、エンコードをより正確に検出できます。

この例では、 BeautifulSoup4を使用しています（ UnicodeDammitは内部でchardetを使用するため、自動検出を向上させるためにchardetをインストールする必要があります）：

from bs4 import UnicodeDammit

if http_charset == "":
    ud = UnicodeDammit(content, is_html=True)
else:
    ud = UnicodeDammit(content, override_encodings=[http_charset], is_html=True)
root = lxml.html.fromstring(ud.unicode_markup)

または、前の回答をより完全にするために、次のように変更できます。

if ud.original_encoding != 'utf-8':
    content = content.decode(ud.original_encoding, 'replace').encode('utf-8')

なぜこれがchardetを使用する単純なものよりも優れているのですか？

Content- TypeHTTPヘッダーを無視しません

Content-Type：text / html; charset = utf-8
http-equivメタタグを無視しないでください。例：

... http-equiv = "Content-Type" content = "text / html; charset =UTF-8"..。
これに加えて、 chardet、cjkcodecs、iconvcodecコーデックなどの機能を使用しています。

python - lxmlを使用したPythonでのエンコーディング-複雑なソリューション

2 に答える 2

Related

Reference