python - Python3: CSV 出力での UTF8 非互換文字の処理

Question

私はPython3.2を使用しており、「名前」識別子と「詳細」を使用してCSVファイルに書き込んでいるSQL出力を持っています。中国からの一部のデータでは、人名 (したがって漢字) が挿入されています。私はユニコード/デコードのドキュメントを読むために最善を尽くしましたが、Python内でこれらの文字を全体的にインラインで修正/削除する方法に途方に暮れています.

私は次のようにファイルを実行しています：

import csv, os, os.path
rfile = open(nonbillabletest2.csv,'r',newline='')
dataread= csv.reader(rfile)
trash=next(rfile) #ignores the header line in csv:

#Process the target CSV by creating an output with a unique filename per CompanyName
for line in dataread:
    [CompanyName,Specifics] = line
    #Check that a target csv does not exist
    if os.path.exists('test leads '+CompanyName+'.csv') < 1:
        wfile= open('test leads '+CompanyName+'.csv','a')
        datawrite= csv.writer(wfile, lineterminator='\n')
        datawrite.writerow(['CompanyName','Specifics']) #write new header row in each file created
        datawrite.writerow([CompanyName,Specifics])
wfile.close()    
rfile.close()

次のエラーが表示されます。

Traceback (most recent call last):
  File "C:\Users\Matt\Dropbox\nonbillable\nonbillabletest.py", line 26, in <module>
    for line in dataread:
  File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1886: character maps to <undefined>

ファイルの内容を調べると、明らかに UTF8 以外の文字がいくつかあります。

print(repr(open('nonbillabletest2.csv', 'rb').read()))

b'CompanyName,Specifics\r\neGENTIC,\x86\xac\xff; \r\neGENTIC,\x86\xac\xff; \r\neGENTIC,
\x86\xac\xff; \r\neGENTIC,\x91\x9d?; \r\neGENTIC,\x86\xac\xff; \r\n'

「encoding=utf8」を組み込んでも問題は解決しません。...replace('\x86\xac\xff', '')) を使用して個々の文字を削除できましたが、遭遇するすべての文字に対してこれを行う必要があり、効率的ではありません。

SQL ソリューションがあれば、それも問題ありません。助けてください！

更新: 提案されたように string.printable を使用して文字を削除しました。「コンテンツ」セクションには常に 1 つの最終行があったため、もう 1 つエラーが発生しました。ただし、if len=0 チェックを追加することで問題は解決しました。

迅速なご協力ありがとうございます。

score 1 · Accepted Answer

したがって、nonbillabletest2.csv は UTF-8 でエンコードされていません。

あなたは出来る：

上流で修正してください。あなたが期待するように、UTF-8として適切にエンコードされていることを確認してください。これは、あなたが参照している「SQL ソリューション」である可能性があります。

事前にすべての非ASCII文字を削除します（純粋主義者にとっては、データが破損しますが、あなたが言ったことから、それはあなたにとって受け入れられるようです）

import csv, os, string
rfile = open('nonbillabletest2.csv', 'rb')
rbytes = rfile.read()
rfile.close()

contents = ''
for b in rbytes:
  if chr(b) in string.printable + string.whitespace:
    contents += chr(b)

dataread = csv.reader(contents.split('\r\n'))
....

python - Python3: CSV 出力での UTF8 非互換文字の処理

1 に答える 1

Related

Reference