python - Pythonでウムラウトを使用してファイルを読み書きする（htmlからtxt）

Question

これは何度か聞かれたことは知っていますが、私はすべてを正しくやっていて、それでもうまくいかないと思うので、臨床的に狂気になる前に投稿します。これはコードです（HTMLファイルをtxtファイルに変換し、特定の行を省略しているはずです）：

fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
    return
htmlText = fid.read()
fid.close()

stripped = strip_tags(unicode(htmlText))   ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []

for line in lines: # just some stuff i want to leave out of the output
    if len(line) < 6:
        continue
    if '*' in line or '(' in line or '@' in line or ':' in line:
        continue
    out.append(line)

result=  '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'

fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()

ありがとう！

score 0 · Accepted Answer

わからないが

'\n'.join(out)

非Unicode文字列（ただし、単純な古いbytes文字列）を使用すると、UTF-8以外のコーデックにフォールバックする可能性があります。試す：

u'\n'.join(out)

どこでもUnicodeオブジェクトを使用していることを確認します。

score 0 · Accepted Answer

問題を特定していないため、これは完全な推測です。

関数によって何が返されますstrip_tags()か? Unicode オブジェクトを返していますか、それともバイト文字列ですか? 後者の場合、ファイルに書き込もうとしたときにデコードの問題が発生する可能性があります。たとえばstrip_tags()、utf-8 でエンコードされたバイト文字列を返す場合:

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

これが表示されている場合は、に Unicode を渡すようにする必要がありますfid.write(result)。これはおそらく、から Unicode が確実に返されるようにすることを意味しstrip_tags()ます。

また、通過中に気付いた他のいくつかのこと：

codecs.open()IOErrorファイルを開くことができない場合、例外が発生します。None を返さないため、if not fid:テストは役に立ちません。try/exceptを、理想的にはとともに使用する必要がありますwith。

try:
    with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
        htmlText = fid.read()
except IOError, e:
    # handle error
    print e

また、で開いたファイルから読み取ったデータcodecs.open()は自動的に Unicode に変換されるため、呼び出しunicode(htmlText)ても何も達成されません。

python - Pythonでウムラウトを使用してファイルを読み書きする（htmlからtxt）

2 に答える 2

Related

Reference