python - Python 2.7 を使用した Unicode を含む CSV ファイルの読み取りと書き込み

Question

私は Python を初めて使用します。Python を使用して CSV ファイルを読み書きする方法について質問があります。私のファイルにはドイツ語、フランス語などが含まれています。私のコードによると、ファイルは Python で正しく読み取ることができますが、新しい CSV ファイルに書き込むと、Unicode が奇妙な文字になります。

データは次のようになります。
ここに画像の説明を入力

そして私のコードは次のとおりです。

import csv

f=open('xxx.csv','rb')
reader=csv.reader(f)

wt=open('lll.csv','wb')
writer=csv.writer(wt,quoting=csv.QUOTE_ALL)

wt.close()
f.close()

結果は次のようになります。
ここに画像の説明を入力

問題を解決するにはどうすればよいですか?

score 58 · Accepted Answer

適切にエンコードおよびデコードしてください。

この例では、utf-8 のサンプルテキストを csv ファイルにラウンドトリップし、バックアウトしてデモンストレーションします。

# -*- coding: utf-8 -*-
import csv

tests={'German': [u'Straße',u'auslösen',u'zerstören'], 
       'French': [u'français',u'américaine',u'épais'], 
       'Chinese': [u'中國的',u'英語',u'美國人']}

with open('/tmp/utf.csv','w') as fout:
    writer=csv.writer(fout)    
    writer.writerows([tests.keys()])
    for row in zip(*tests.values()):
        row=[s.encode('utf-8') for s in row]
        writer.writerows([row])

with open('/tmp/utf.csv','r') as fin:
    reader=csv.reader(fin)
    for row in reader:
        temp=list(row)
        fmt=u'{:<15}'*len(temp)
        print fmt.format(*[s.decode('utf-8') for s in temp])

版画:

German         Chinese        French         
Straße         中國的            français       
auslösen       英語             américaine     
zerstören      美國人            épais

score 30 · Accepted Answer

csv モジュールドキュメントの最後に、Unicode の処理方法を示す例があります。以下は、その例から直接コピーされます。読み書きされる文字列は Unicode 文字列になることに注意してください。UnicodeWriter.writerowsたとえば、バイト文字列をに渡さないでください。

import csv,codecs,cStringIO

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

with open('xxx.csv','rb') as fin, open('lll.csv','wb') as fout:
    reader = UnicodeReader(fin)
    writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
    for line in reader:
        writer.writerow(line)

入力 (UTF-8 エンコード):

American,美国人
French,法国人
German,德国人

出力：

"American","美国人"
"French","法国人"
"German","德国人"

score 2 · Accepted Answer

上記の Mark には応答できませんでしたが、セル内のデータが Unicode ではない場合 (float または int データなど) に発生したエラーを修正する変更を 1 つ加えただけです。この行を UnicodeWriter 関数に置き換えました: "self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])" :

class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
       self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") if type(s)==types.UnicodeType else s for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

「タイプのインポート」も必要になります。

score 2 · Accepted Answer

私はまったく同じ問題を抱えていました。答えは、あなたはすでにそれを正しく行っているということです。MS Excelの問題です。別のエディターでファイルを開いてみると、エンコードが既に成功していることに気付くでしょう。MS Excel を使いやすくするには、UTF-8 から UTF-16 に移行します。これはうまくいくはずです：

class UnicodeWriter:
def __init__(self, f, dialect=csv.excel_tab, encoding="utf-16", **kwds):
    # Redirect output to a queue
    self.queue = StringIO.StringIO()
    self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
    self.stream = f

    # Force BOM
    if encoding=="utf-16":
        import codecs
        f.write(codecs.BOM_UTF16)

    self.encoding = encoding

def writerow(self, row):
    # Modified from original: now using unicode(s) to deal with e.g. ints
    self.writer.writerow([unicode(s).encode("utf-8") for s in row])
    # Fetch UTF-8 output from the queue ...
    data = self.queue.getvalue()
    data = data.decode("utf-8")
    # ... and reencode it into the target encoding
    data = data.encode(self.encoding)

    # strip BOM
    if self.encoding == "utf-16":
        data = data[2:]

    # write to the target stream
    self.stream.write(data)
    # empty queue
    self.queue.truncate(0)

def writerows(self, rows):
    for row in rows:
        self.writerow(row)

python - Python 2.7 を使用した Unicode を含む CSV ファイルの読み取りと書き込み

6 に答える 6

Related

Reference