python - PythonでのUTF-8URLのデコード

Question

「pe％20to％C5％A3i％20mai」のような文字列があります。urllib.parse.unquoteを適用すると、「pe to\u0163imai」が表示されます。これをファイルに書き込もうとすると、予想されるグリフではなく、正確な記号が表示されます。

文字列をutf-8に変換して、ファイルに代わりに適切なグリフを含めるにはどうすればよいですか？

編集：私はPython3.2を使用しています

Edit2：それで、urllib.parse.unquoteが正しく機能していることがわかりました。実際の問題は、YAMLにシリアル化していて、問題が発生しyaml.dumpているようです。なんで？

score 4 · Accepted Answer

更新：出力ファイルがyamlドキュメントの場合、無視できます\u0163。Unicodeエスケープはyamlドキュメントで有効です。

#!/usr/bin/env python3
import json

# json produces a subset of yaml
print(json.dumps('pe toţi mai')) # -> "pe to\u0163i mai"
print(json.dumps('pe toţi mai', ensure_ascii=False)) # -> "pe toţi mai"

注：\u最後の場合はありません。両方の行は同じPython文字列を表します。

yaml.dump()同様のオプションがあります：allow_unicode。TrueUnicodeエスケープを回避するように設定します。

URLは正しいです。あなたはそれで何もする必要はありません：

#!/usr/bin/env python3
from urllib.parse import unquote

url =  "pe%20to%C5%A3i%20mai"
text = unquote(url)

with open('some_file', 'w', encoding='utf-8') as file:
    def p(line):
        print(line, file=file) # write line to file

    p(text)                # -> pe toţi mai
    p(repr(text))          # -> 'pe toţi mai'
    p(ascii(text))         # -> 'pe to\u0163i mai'

    p("pe to\u0163i mai")  # -> pe toţi mai
    p(r"pe to\u0163i mai") # -> pe to\u0163i mai
    #NOTE: r'' prefix

この\u0163シーケンスは、文字エンコードエラーハンドラによって導入される可能性があります。

with open('some_other_file', 'wb') as file: # write bytes
    file.write(text.encode('ascii', 'backslashreplace')) # -> pe to\u0163i mai

または：

with open('another', 'w', encoding='ascii', errors='backslashreplace') as file:
    file.write(text) # -> pe to\u0163i mai

その他の例：

# introduce some more \u escapes
b = r"pe to\u0163i mai ţţţ".encode('ascii', 'backslashreplace') # bytes
print(b.decode('ascii')) # -> pe to\u0163i mai \u0163\u0163\u0163
# remove unicode escapes
print(b.decode('unicode-escape')) # -> pe toţi mai ţţţ

score 2 · Accepted Answer

Python 3

呼び出すurllib.parse.unquoteと、すでにUnicode文字列が返されます。

>>> urllib.parse.unquote("pe%20to%C5%A3i%20mai")
'pe toţi mai'

その結果が得られない場合は、コードのエラーである必要があります。コードを投稿してください。

Python 2

decodeバイト文字列からUnicode文字列を取得するために使用します。

>>> import urllib2
>>> print urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')
pe toţi mai

Unicode文字列をファイルに書き込むときは、ファイルを再度エンコードする必要があることに注意してください。ファイルにUTF-8として書き込むこともできますが、必要に応じて別のエンコーディングを選択することもできます。また、ファイルから読み戻すときは、同じエンコーディングを使用することを忘れないでください。このcodecsモジュールは、ファイルの読み取りおよび書き込み時にエンコーディングを指定するのに役立つ場合があります。

>>> import urllib2, codecs
>>> s = urllib2.unquote("pe%20to%C5%A3i%20mai").decode('utf-8')

>>> # Write the string to a file.
>>> with codecs.open('test.txt', 'w', 'utf-8') as f:
...     f.write(s)

>>> # Read the string back from the file.
>>> with codecs.open('test.txt', 'r', 'utf-8') as f:
...     s2 = f.read()

混乱を招く可能性のある問題の1つは、インタラクティブインタプリタでは、Unicode文字列が\uxxxx実際の文字の代わりに表記法を使用して表示される場合があることです。

>>> s
u'pe to\u0163i mai'
>>> print s
pe toţi mai

これは、文字列が「間違っている」という意味ではありません。それは通訳が働く方法です。

score 2 · Accepted Answer

decodeを使用してみてくださいunicode_escape。

例えば：

>>> print "pe to\u0163i mai".decode('unicode_escape')
pe toţi mai

score 1 · Accepted Answer

urllib.parse.unquote返された正しいUTF-8文字列と、返されたファイルに直接書き込むと、期待どおりの結果が得られました。問題はyamlにありました。デフォルトでは、UTF-8でエンコードしません。

私の解決策は次のことでした：

yaml.dump("pe%20to%C5%A3i%20mai",encoding="utf-8").decode("unicode-escape")

問題を理解するのに役立つ正しい質問をしてくれたJFSebastianとMarkByersに感謝します。

python - PythonでのUTF-8URLのデコード

4 に答える 4

Related

Reference