python - Unicode 文字列を引数として urllib.urlencode() に渡す方法

Question

Microsoft の無料翻訳サービスを使用して、ヒンディー語の文字を英語に翻訳しています。Python 用の API は提供していませんが、tinyurl.com/dxh6thr からコードを借りました。

ここで説明されているように、「検出」メソッドを使用しようとしています: tinyurl.com/bxkt3we

「hindi.txt」ファイルはユニコード文字セットで保存されます。

>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>

応答は、Translator が「hi」(ヒンディー語) ではなく「en」を検出したことを示しています。エンコーディングを確認すると、「文字列」として表示されます。

>>> type(hindi_string)
<type 'str'>

参考までに、「hindi.txt」の内容は次のとおりです。

हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।

string.encode または string.decode の使用がここに適用されるかどうかはわかりません。もしそうなら、何をエンコード/デコードする必要がありますか? Unicode 文字列を urllib.urlencode 引数として渡す最良の方法は何ですか? 実際のヒンディー語文字が引数として渡されるようにするにはどうすればよいですか?

ありがとうございました。

** 追加情報 **

提案どおり codecs.open() を使用してみましたが、次のエラーが発生します。

>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\codecs.py", line 671, in read
    return self.reader.read(size)
  File "C:\Python27\lib\codecs.py", line 477, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

repr(hindi_string) の出力は次のとおりです。

>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"

score 2 · Accepted Answer

ファイルはutf-16であるため、送信する前にコンテンツをデコードする必要があります。

hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...

score 0 · Accepted Answer

次を使用してファイルを開き、codecs.openデコードしてみてutf-8ください。

import codecs

with codecs.open('hindi.txt', encoding='utf-8') as f:
    hindi_text = f.read()

python - Unicode 文字列を引数として urllib.urlencode() に渡す方法

2 に答える 2

Related

Reference