python - python と lxml を使用してページをクロールする - (, UnicodeEncodeError('ascii',

Question

ページを取得するために python2.7 と lxml を使用しています。以下のエラーが発生し続けます。

(<type 'exceptions.UnicodeEncodeError'>, UnicodeEncodeError('ascii', u'Approximate Dimensions: 4\xbd" x 4" x 7" (assembled)', 25, 26, 'ordinal not in range(128)'), <traceback object at 0x7f9198ac48c0>)

私は次のことを試しました：

doc = lxml.html.document_fromstring(html)
for el in doc.iter('h2'):
    el.text_content().decode('utf-8','ignore')
    OR
    el.text_content().encode('ascii', 'ignore')

これらのエラーを解決するにはどうすればよいですか? 1) テキストファイルに保存してから、2) テキストファイルを MySQL にアップロードできるようにする必要があります。

ありがとう

score 2 · Accepted Answer

試す：

el.text_content().encode('utf-8')

これは Unicode であり、(テキストとして) utf-8 に保存したいとします。

score 0 · Accepted Answer

ページがエンコードに使用することを示すヘッダーは、実際のものとは異なる場合があります。ページの実際のエンコーディングが utf-8 でない場合、適切なビジネスを行うのは少しトリッキーです。

最初に、から返されたテキストを確認する必要があります。el.text_content()

x = el.text_content() print x

のようなエンコードされた文字列がまだ残っている場合は/x09、まだデコードされていないことを意味します。

x が Unicode の場合 (「u」で始まる) 、適切なエンコーディング (または sth など)に変換unicodeしてデコードする必要があります。strcp1252

chars = ''.join([chr(ord(x)) for x in el.text_content()]) /// It will change your dumb unicode to str result = chars.decode({try with different encoding until it doesn't throw an error}) /// now you decode str with proper format

python - python と lxml を使用してページをクロールする - (, UnicodeEncodeError('ascii',

2 に答える 2

Related

Reference