python - encode('utf-8') を使用して Python で Excel から文字列を読み取ることの欠点

Question

次の一般的な構造を使用して、スプレッドシートから読み取った (および再フォーマットして書き直した) Excel スプレッドシートから大量のデータを読み取っています。

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
     z = i + 1
     toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
     out.write(toprint)
     out.write("\n")

この場合、x と y は任意のセルです。x はあまり任意ではなく、utf-8 文字を含みます。

これまでのところ、エラーが発生することがわかっているか、utf-8 を使用せずにエラーが予測されるセルでのみ .encode('utf-8') を使用してきました。

私の質問は基本的に次のとおりです。不要な場合でも、すべてのセルで .encode('utf-8') を使用することに不利な点はありますか? 効率は問題ではありません。主な問題は、あるべきではない場所に utf-8 文字があっても機能することです。「.encode('utf-8')」をすべてのセル読み取りに一括してもエラーが発生しない場合は、おそらくそれを行うことになります。

score 4 · Accepted Answer

XLRD ドキュメントには、「Excel 97 以降、Excel スプレッドシートのテキストは Unicode として保存されています。」と明確に記載されています。97 よりも新しいファイルを読み込んでいる可能性が高いため、それらには Unicode コードポイントが含まれています。したがって、これらのセルの内容を Python 内で Unicode として保持し、それらを ASCII に変換しないことが必要です (これは str() 関数で行います)。以下のコードを使用します。

book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
    z = i + 1
    toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
    out.write(toprint.encode('UTF-8'))

score 0 · Accepted Answer

この回答は、実際には受け入れられた回答に対するいくつかの穏やかなコメントですが、SO コメント機能が提供するよりも優れた書式設定が必要です。

(1) SO 水平スクロールバーを避けると、人々がコードを読む可能性が高くなります。たとえば、次のように行を折り返してみてください。

toprint = u"".join([
    u"formatting of the data im writing. "
    u"important stuff is to the right -> ",
    unicode(sheettwo.cell(z,y).value),
    u" more formatting! ",
    unicode(sheettwo.cell(z,x).value),
    u" and done\n"
    ])
out.write(toprint.encode('UTF-8'))

(2) おそらくunicode()、float と int を Unicode に変換するために使用しています。すでにユニコードになっている値に対しては何もしません。unicode()()と同様strに、浮動小数点数の精度は 12 桁しかないことに注意してください。

>>> unicode(123456.78901234567)
u'123456.789012'

それが面倒なら、次のようなことを試してみてください。

>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'

(3)必要に応じてオンザフライでオブジェクトをxlrd構築します。Cell

sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

python - encode('utf-8') を使用して Python で Excel から文字列を読み取ることの欠点

2 に答える 2

Related

Reference