python - Python、正規表現の分割と特殊文字

Question

セパレータとして空白を使用して、特殊文字を含む文を含む文字列を正しく分割するにはどうすればよいですか? 正規表現の分割方法を使用すると、目的の結果が得られません。

コード例:

# -*- coding: utf-8 -*-
import re


s="La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)").split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

出力は次のとおりです。

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicit', '\xc3', '', '\xa0', '', ' ', '', '\xc3', '', '\xa8', '', ' ', 'tutto']
 word> La
 word>  
 word> felicit
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> 
 word> Ã
 word> 
 word> ?
 word> 
 word>  
 word> tutto

次のような出力を探している間：

 s> La felicità è tutto
 wordlist> ['La', ' ', 'felicità', ' ', 'è', ' ', 'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

s は別のメソッドから返される文字列であるため、次のようなエンコーディングを強制できないことに注意してください

s=u"La felicità è tutto"

Unicode と reg-ex の公式の python ドキュメントでは、満足のいく説明が見つかりませんでした。

ありがとう。

アレッサンドロ

score 16 · Accepted Answer

正規表現は次(\s) のようにする必要があります(\W)。

l = re.compile("(\s)").split(s)

上記のコードは、要求した正確な出力を提供します。ただし、次の行はより理にかなっています。

l = re.compile("\s").split(s)

これは空白文字で分割され、すべてのスペースが一致するわけではありません。ただし、それらが必要になる場合があるため、両方の回答を投稿しました。

score 4 · Accepted Answer

正規表現のエンコーディングを定義してみてください。

l=re.compile("\W", re.UNICODE).split(s)

score 3 · Accepted Answer

この場合、正規表現を使用するのはやり過ぎだと思います。やりたいことだけが空白文字で文字列を分割する場合は、文字列でsplitメソッドを使用することをお勧めします

s = 'La felicità è tutto'
words = s.split()

score 3 · Accepted Answer

最初にユニコード文字列を指定すれば、ユニコード正規表現を使用しても機能します（提供された例にはありません）。これを試して：

s=u"La felicità è tutto" # "The happiness is everything" in italian
l=re.compile("(\W)",re.UNICODE).split(s)

print " s> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

結果：

 s> La felicità è tutto
 wordlist> [u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto']
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto

文字列sはstr型として作成され、おそらく Unicode とは異なる utf-8 コーディングになります。

score 0 · Accepted Answer

さて、Andrew Hareの回答をさらにテストした後、その文字が ()[]- などであることがわかりました。英数字の値セットは、最終的にアクセント付きの文字 (つまり、Unicode で英数字としてマークされたすべて) で展開されます。したがって、kgiannakakis の解決策はより正確ですが、文字列 s を Unicode 形式に変換できません。

最初の例を次のように拡張します。

# -*- coding: utf-8 -*-
import re
s="(La felicità è tutto)"#no explicit unicode given string (UTF8)
l=re.compile("([\W])",re.UNICODE).split(unicode(s,'utf-8'))#split on s converted to unicode from utf8

print " string> "+s
print " wordlist> "+str(l)
for i in l:
    print " word> "+i

出力は次のとおりです。

 string> (La felicità è tutto)
 wordlist> [u'', u'(', u'La', u' ', u'felicit\xe0', u' ', u'\xe8', u' ', u'tutto', u')', u'']
 word> 
 word> (
 word> La
 word>  
 word> felicità
 word>  
 word> è
 word>  
 word> tutto
 word> )
 word>

それがまさに私が探しているものです。

乾杯：）

アレッサンドロ

python - Python、正規表現の分割と特殊文字

5 に答える 5

Related

Reference