python - ユニコードのアクセント付き文字をアクセントなしの純粋なASCIIに変換する方法は?

Question

http://dictionary.reference.com/browse/apple?s=tのような辞書サイトからコンテンツをダウンロードしようとしています。

私が抱えている問題は、元の段落に波線や逆文字などがすべて含まれているため、ローカルファイルを読み取ると、\x85、\xa7、\x8d などの変なエスケープ文字になってしまうことです。 .

私の質問は、これらすべてのエスケープ文字をそれぞれの UTF-8 文字に変換する方法はありますか?

Python 呼び出しコード:

import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)

Windows 7 システムで wget-1.11.4-1 を使用しています (Linux の人を殺さないでください。これはクライアントの要件でした)。wget exe は Python 2.6 スクリプトファイルで起動されています。

score 47 · Accepted Answer

ユニコードàがある場合のように、これらすべてのエスケープ文字をそれぞれの文字に変換するにはどうすればよいですか？それを標準に変換するにはどうすればよいですか？

ユニコードを...という変数にロードしたと仮定します。àをaにmy_unicode正規化するのはこれだけ簡単です...

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

明示的な例...

>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>

それがどのように機能
unicodedata.normalize('NFD', "insert-unicode-text-here")するかは、UnicodeテキストのCanonical Decomposition（NFD）を実行します。次にstr.encode('ascii', 'ignore')、NFDでマップされた文字をASCIIに変換するために使用します（エラーを無視します）。

score 3 · Accepted Answer

@Mike Pennington のソリューションは、彼のおかげでうまく機能します。しかし、その解決策を試してみると、NFD で定義されていないいくつかの特殊文字 (つまり、トルコ語のアルファベットの ı 文字) が失敗することに気付きました。

この変換に unidecode ライブラリを使用できる別のソリューションを発見しました。

>>>import unidecode
>>>example = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz"


#convert it to utf-8
>>>utf8text = unicode(example, "utf-8")

>>> print utf8text
ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz

#convert utf-8 to ascii text
asciitext = unidecode.unidecode(utf8text)

>>>print asciitext

ABCCDEFGGHIIJKLMNOOPRSSTUUVYZabccdefgghiijklmnooprsstuuvyz

score 2 · Accepted Answer

私はこのようなものが必要でしたが、アクセント付きの文字のみを削除し、特別なものを無視して、この小さな機能を実行しました:

# ~*~ coding: utf-8 ~*~
import re

def remove_accents(string):
    if type(string) is not unicode:
        string = unicode(string, encoding='utf-8')

    string = re.sub(u"[àáâãäå]", 'a', string)
    string = re.sub(u"[èéêë]", 'e', string)
    string = re.sub(u"[ìíîï]", 'i', string)
    string = re.sub(u"[òóôõö]", 'o', string)
    string = re.sub(u"[ùúûü]", 'u', string)
    string = re.sub(u"[ýÿ]", 'y', string)

    return string

他の文字を無視する必要がある場合に備えてカスタマイズできるため、この機能が気に入っています

score 0 · Accepted Answer

指定された URL は、HTTP 応答が明確に示すように UTF-8 を返します。

wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40--  http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Server: Apache
  Cache-Control: private
  Content-Type: text/html;charset=UTF-8
  Date: Wed, 02 Jan 2013 07:43:40 GMT
  Transfer-Encoding:  chunked
  Connection: keep-alive
  Connection: Transfer-Encoding
  Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
  Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
  Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
  Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
  Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]

vim を使用して保存されたファイルを調査すると、データが正しく utf-8 でエンコードされていることも明らかになります... Python を使用して URL をフェッチする場合も同様です。

score 0 · Accepted Answer

問題は私にとっては異なっていましたが、このスタックページはunicodedata.normalize('NFKC', 'Ｖ').encode('ascii', 'ignore') 出力を解決するために機能します-b'V'

python - ユニコードのアクセント付き文字をアクセントなしの純粋なASCIIに変換する方法は?

5 に答える 5

Related

Reference