python - Pythonで単語を分割する正規表現

Question

特定のテキストから実際のすべての単語を分割する正規表現を設計していました。

入力例:

"John's mom went there, but he wasn't there. So she said: 'Where are you'"

期待される出力:

["John's", "mom", "went", "there", "but", "he", "wasn't", "there", "So", "she", "said", "Where", "are", "you"]

私はそのような正規表現を考えました:

"(([^a-zA-Z]+')|('[^a-zA-Z]+))|([^a-zA-Z']+)"

Python で分割した後、結果にはNoneアイテムと空のスペースが含まれます。

None アイテムを取り除くには？そして、なぜスペースが一致しなかったのですか?

編集:
スペースで分割すると、次のようなアイテムが得られます: 非文字で分割すると、次のような["there."]
アイテムが得られます:["John","s"]
そして、以外の非文字で分割する'と、次のようなアイテムが得られます:["'Where","you'"]

score 26 · Accepted Answer

正規表現の代わりに、文字列関数を使用できます。

to_be_removed = ".,:!" # all characters to be removed
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"

for c in to_be_removed:
    s = s.replace(c, '')
s.split()

しかし、あなたの例では、アポストロフィを削除したくありませんが、で削除したいと考えてJohn'sいますyou!!'。そのため、文字列操作はその時点で失敗し、細かく調整された正規表現が必要です。

編集:おそらく単純な正規表現で問題を解決できます:

(\w[\w']*)

文字で始まるすべての文字をキャプチャし、次の文字がアポストロフィまたは文字である間、キャプチャを続けます。

(\w[\w']*\w)

この 2 番目の正規表現は、非常に特殊な状況のためのものです....最初の正規表現は、のような単語をキャプチャできますyou'。これはこれを回避し、アポストロフィが単語内にある場合にのみアポストロフィをキャプチャします (先頭または末尾ではありません)。Moss' momしかし、その時点で、2番目の正規表現でアポストロフィをキャプチャできないという状況が発生します。wit で終わり、所有権を定義する名前の末尾のアポストロフィをキャプチャするかどうかを決定する必要があります。

例：

rgx = re.compile("([\w][\w']*\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you']

更新 2: 正規表現にバグが見つかりました! のようにアポストロフィが続く単一文字をキャプチャすることはできませんA'。修正された新しい正規表現は次のとおりです。

(\w[\w']*\w|\w)

rgx = re.compile("(\w[\w']*\w|\w)")
s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
rgx.findall(s)

["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', 'a']

score 8 · Accepted Answer

正規表現のキャプチャグループが多すぎます。それらを非キャプチャにします：

(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)

デモ：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!'"
>>> re.split("(?:(?:[^a-zA-Z]+')|(?:'[^a-zA-Z]+))|(?:[^a-zA-Z']+)", s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', '']

これは、空の要素を1 つだけ返します。

score 2 · Accepted Answer

この正規表現は、末尾のアポストロフィを 1 つだけ許可し、その後にもう 1 文字続く場合があります。

([\w][\w]*'?\w?)

デモ：

>>> import re
>>> s = "John's mom went there, but he wasn't there. So she said: 'Where are you!!' 'A a'"
>>> re.compile("([\w][\w]*'?\w?)").findall(s)
["John's", 'mom', 'went', 'there', 'but', 'he', "wasn't", 'there', 'So', 'she', 'said', 'Where', 'are', 'you', 'A', "a'"]

python - Pythonで単語を分割する正規表現

4 に答える 4

Related

Reference