python - pythonエンコーディングutf-8

Question

Pythonでいくつかのスクリプトを実行しています。ファイルに保存する文字列を作成します。この文字列は、ディレクトリの樹木とファイル名に由来する多くのデータを取得しています。convmv によると、私の樹木はすべて UTF-8 です。

後でMySQLに保存するので、すべてをUTF-8に保ちたいです。今のところ、UTF-8 の MySQL では、一部の文字 (é や è など - 私はフランス人です) で問題が発生しました。

Python が常に文字列を UTF-8 として使用するようにします。私はインターネットでいくつかの情報を読みましたが、私はこれが好きでした。

私のスクリプトはこれで始まります：

 #!/usr/bin/python
 # -*- coding: utf-8 -*-
 def createIndex():
     import codecs
     toUtf8=codecs.getencoder('UTF8')
     #lot of operations & building indexSTR the string who matter
     findex=open('config/index/music_vibration_'+date+'.index','a')
     findex.write(codecs.BOM_UTF8)
     findex.write(toUtf8(indexSTR)) #this bugs!

そして、私が実行すると、ここに答えがあります:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

編集：私のファイルでは、アクセントがうまく書かれています。このファイルを作成した後、それを読み取り、MySQL に書き込みます。しかし、理由はわかりませんが、エンコードに問題があります。MySQL データベースが utf8 であるか、SQL クエリSHOW variables LIKE 'char%'が utf8 またはバイナリのみを返すようです。

私の関数は次のようになります:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def saveIndex(index,date):
    import MySQLdb as mdb
    import codecs

    sql = mdb.connect('localhost','admin','*******','music_vibration')
    sql.charset="utf8"
    findex=open('config/index/'+index,'r')
    lines=findex.readlines()
    for line in lines:
        if line.find('#artiste') != -1:
            artiste=line.split('[:::]')
            artiste=artiste[1].replace('\n','')

            c=sql.cursor()
            c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"')
            nbr=c.fetchone()
            if nbr[0]==0:
                c=sql.cursor()
                iArt+=1
                c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

ファイルにきれいに表示されているアーティストは、BDD に書き込みます。何が問題ですか？

score 62 · Accepted Answer

既にエンコードされているデータをエンコードする必要はありません。これを行おうとすると、Python はUTF-8 にエンコードして戻す前に、まずにデコードしようとします。unicodeそれがここで失敗しているものです：

>>> data = u'\u00c3'            # Unicode data
>>> data = data.encode('utf8')  # encoded to UTF-8
>>> data
'\xc3\x83'
>>> data.encode('utf8')         # Try to *re*-encode it
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

データをファイルに直接書き込むだけで、既にエンコードされたデータをエンコードする必要はありません。

代わりに値を構築する場合unicodeは、ファイルに書き込み可能になるようにそれらをエンコードする必要があります。codecs.open()代わりに、Unicode 値を UTF-8 にエンコードするファイルオブジェクトを返すを使用することをお勧めします。

また、UTF-8を読み取ることができない Microsoft ツール (MS Notepad など) をサポートする必要がない限り、UTF-8 BOM を書き出す必要はありません。

MySQL 挿入の問題については、次の 2 つのことを行う必要があります。

通話に追加charset='utf8'しますMySQLdb.connect()。

クエリまたは挿入するときunicodeはオブジェクトではなくオブジェクトを使用しますが、 SQL パラメーターを使用して、MySQL コネクタが適切な処理を実行できるようにします。str

artiste = artiste.decode('utf8')  # it is already UTF8, decode to unicode

c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))

# ...

c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))

codecs.open()代わりにコンテンツを自動的にデコードしていた場合は、実際にはよりうまく機能する可能性があります。

import codecs

sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8')

with codecs.open('config/index/'+index, 'r', 'utf8') as findex:
    for line in findex:
        if u'#artiste' not in line:
            continue

        artiste=line.split(u'[:::]')[1].strip()

    cursor = sql.cursor()
    cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,))
    if not cursor.fetchone()[0]:
        cursor = sql.cursor()
        cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/'))
        artists_inserted += 1

Unicode と UTF-8 およびエンコーディングについてブラッシュアップすることをお勧めします。次の記事をお勧めします。

Python Unicode HOWTO
Ned Batchelder による実用的な Unicode
すべてのソフトウェア開発者が絶対に、積極的に Unicode と文字セットについて知っておく必要がある絶対的な最小値 (言い訳はありません!) by Joel Spolsky

score 3 · Accepted Answer

残念ながら、string.encode() メソッドは常に信頼できるとは限りません。詳細については、このスレッドを確認してください: Python で一部の文字列 (utf-8 またはその他) を単純な ASCII 文字列に変換する簡単な方法は何ですか?

python - pythonエンコーディングutf-8

2 に答える 2

Related

Reference