python - Python正規表現で一致するUnicode文字

Question

Stackoverflowで他の質問を読みましたが、まだ詳しくはありません。申し訳ありませんが、これはすでに回答済みですが、そこで提案されたものは何も機能しませんでした。

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

すべてが順調です。次に、ノルウェー語の文字を含むもの（またはよりユニコードのようなもの）を試してみます。

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

øæåのような典型的なUnicode文字をどのように一致させることができますか？上記のタググループとファイル名のタググループの両方で、これらの文字も一致させることができるようにしたいと思います。

score 49 · Accepted Answer

re.UNICODEフラグをu指定し、プレフィックスを使用して文字列をUnicode文字列として入力する必要があります。

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

これはPython2にあります。Python 3uでは、すべての文字列がUnicodeであるため、を除外する必要があります。

score 13 · Accepted Answer

UNICODEフラグが必要です。

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

score 6 · Accepted Answer

Python 2では、 re.UNICODEフラグとUnicode文字列コンストラクターが必要です

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+

（後者の場合、コンマは中国語のコンマです。）

python - Python正規表現で一致するUnicode文字

3 に答える 3

Related

Reference