python - Python は非標準文字を変換します

Question

非標準文字を含む Web ページから取得したリストがあります。

リストの例:

[<td class="td-number-nowidth"> 10Â 115 </td>, <td class="td-number-nowidth"> 4Â 635 (46%) </td>, <td class="td-number-nowidth"> 5Â 276 (52%) </td>, ...]

帽子の A はコンマのはずです。リストの最初の値のように値10115を取得できるように、これらを変換または置換する方法を誰かが提案できますか?

ソースコード：

from urllib import urlopen
from bs4 import BeautifulSoup
import re, string
content = urlopen('http://www.worldoftanks.com/community/accounts/1000395103-FrankenTank').read()
soup = BeautifulSoup(content)

BattleStats = soup.find_all('td', 'td-number-nowidth')
print BattleStats

ありがとう、フランク

score 3 · Accepted Answer

Content-Encodingウェブサイトはヘッダーのエンコーディングについて述べていますか? それを取得し、メソッドを使用してリスト内の文字列をデコードする必要があります.decode。encoded_string.decode("encoding")のようになります。それらの1つであるencodingことは、何でもかまいませんutf-8。

score 2 · Accepted Answer

.decodeパラメータを指定してメソッドを使用できerrors='ignore'ます。

>>> s = '[ 10Â 115 , 4Â 635 (46%) , 5Â 276 (52%) , ...]'
>>> s.decode('ascii', errors='ignore')
u'[ 10 115 , 4 635 (46%) , 5 276 (52%) , ...]'

ここにあるhelp(''.decode)：

decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

score 0 · Accepted Answer

BeautifulSoupは文字エンコードを自動的に処理します。問題は、一部のUnicode文字をサポートしていないように見えるコンソールへの印刷にあります。この場合は'NO-BREAK SPACE' (U+00A0)：

>>> L = soup.find_all('td', 'td-number-nowidth')
>>> L[0]
<td class="td-number-nowidth"> 10 123 </td>
>>> L[0].get_text()
u' 10\xa0123 '

テキストがUnicodeであることに注意してください。あなたのケースで機能するかどうかを確認print u'<\u00a0>'してください。

PYTHONIOENCODINGスクリプトを実行する前に環境変数を変更することで、使用される出力エンコーディングを操作できます。utf-8したがって、スクリプトを変更せずに、出力をエンコードを指定するファイルにリダイレクトascii:backslashreplaceし、コンソールでのデバッグ実行に値を使用できます。bashの例：

$ python -c 'print u"<\u00a0>"' # use default encoding
< >
$ PYTHONIOENCODING=ascii:backslashreplace python -c 'print u"<\u00a0>"'
<\xa0>
$ PYTHONIOENCODING=utf-8 python -c 'print u"<\u00a0>"' > output.txt

対応する番号を印刷するには、後でアイテムを処理するために、壊れないスペースで分割できます。

>>> [td.get_text().split(u'\u00a0')
...  for td in soup.find_all('td', 'td-number-nowidth')]
[[u' 10', u'115 '], [u' 4', '635 (46%) '], [u' 5', u'276 (52%) ']]

または、コンマに置き換えることもできます。

>>> [td.get_text().replace(u'\u00a0', ', ').encode('ascii').strip()
...  for td in soup.find_all('td', 'td-number-nowidth')]
['10, 115', '4, 635 (46%)', '5, 276 (52%)']

score 0 · Accepted Answer

お試しはありますか？

これはうまくいくかもしれません。

a =  ['10Â 115', '4Â 635 (46%)', '5Â 276 (52%)']
for b in a:
    b.replace("\xc3\x82 ", '')

出力：

10115
4635 (46%)
5276 (52%)

それがどれほど一定であるかによって (常にドット付きの a のみである場合)、より良い方法があるかもしれません (\ からスペースまでを空白文字に置き換えます)。

python - Python は非標準文字を変換します

4 に答える 4

Related

Reference