python - フランス語テキストの処理 Python

Question

フランス語のテキストを読んで、単語の頻度分析をしようとしています。ウムラウトやその他の分音符号のある文字は残してほしい。だから、私はテストのためにこれをしました：

>>> import codecs
>>> f = codecs.open('file','r','utf-8')
>>> for line in f:
...     print line
...

Faites savoir à votre famille que vous êtes en sécurité.

ここまでは順調ですね。しかし、次の方法で繰り返し処理するフランス語のファイルのリストがあります。

import codecs,sys,os

path = sys.argv[1]
for f in os.listdir(path):
    french = codecs.open(os.path.join(path,f),'r','utf-8')
    for line in french:
        print line

ここで、次のエラーが発生します。

rdholaki74: python TestingCodecs.py ../frenchResources | more
Traceback (most recent call last):
  File "TestingCodecs.py", line 7, in <module>
    print line
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 14: ordinal not in range(128)

コードで明示的に指定された場合ではなく、引数として渡された場合に同じファイルがエラーをスローするのはなぜですか?

ありがとう。

score 2 · Accepted Answer

原因を誤解しているからです。出力をパイプしているという事実は、Python が使用するエンコーディングを検出できないことを意味します。stdoutがTTY でない場合は、出力する前に手動で UTF-8 としてエンコードする必要があります。

score 2 · Accepted Answer

It is a print error due to redirection. You could use:

PYTHONIOENCODING=utf-8 python ... | ...

Specify another encoding if your terminal doesn't use utf-8

python - フランス語テキストの処理 Python

2 に答える 2

Related

Reference