python - utf-16-le BOM csv ファイル

Question

Playstore (stats など) からいくつかの CSV ファイルをダウンロードしており、Python で処理したいと考えています。

cromestant@jumphost-vpc:~/stat_dev/bime$ file -bi stats/installs/*
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le
text/plain; charset=utf-16le

ご覧のとおり、utf-16le です。

一部のファイルで機能し、他のファイルでは機能しないpython 2.7のコードがあります。

import codecs
.
.
fp =codecs.open(dir_n+'/'+file_n,'r',"utf-16")
 for line in fp:
  #write to mysql db

これは、次の時点まで機能します。

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 10: ordinal not in range(128)

これを行う適切な方法は何ですか？「再エンコード」が cvs モジュールなどを使用するのを見たことがありますが、csv モジュールはそれ自体でエンコーディングを処理しないため、データベースにダンプするだけではやり過ぎのようです。

score 4 · Accepted Answer

試しましたcodecs.EncodedFileか？

with open('x.csv', 'rb') as f:
    g = codecs.EncodedFile(f, 'utf8', 'utf-16le', 'ignore')
    c = csv.reader(g)
    for row in c:
        print row
        # and if you want to use unicode instead of str:
        row = [unicode(cell, 'utf8') for cell in row]

score 3 · Accepted Answer

これを行う適切な方法は何ですか？

適切な方法は、Unicode サポートがはるかに合理的な Python3 を使用することです。

回避策として、何らかの理由で Python3 にアレルギーがある場合、最善の妥協点はcsv.reader()次のようにをラップすることです。

import codecs
import csv

def to_utf8(fp):
    for line in fp:
        yield line.encode("utf-8")

def from_utf8(fp):
    for line in fp:
        yield [column.decode('utf-8') for column in line]

with codecs.open('utf16le.csv','r', 'utf-16le') as fp:
    reader = from_utf8(csv.reader(to_utf8(fp)))
    for line in reader:
        #"line" is a list of unicode strings
        #write to mysql db
        print line

python - utf-16-le BOM csv ファイル

2 に答える 2

Related

Reference