python - 文字列を単語と句読点に分割する

Question

文字列を単語と句読点に分割し、分割によって作成されたリストに句読点を追加しようとしています。

例えば：

>>> c = "help, me"
>>> print c.split()
['help,', 'me']

私が本当にリストを次のようにしたいのは次のとおりです。

['help', ',', 'me']

そのため、文字列を空白で分割し、句読点を単語から分割したいと考えています。

最初に文字列を解析してから、分割を実行しようとしました:

>>> for character in c:
...     if character in ".,;!?":
...             outputCharacter = " %s" % character
...     else:
...             outputCharacter = character
...     separatedPunctuation += outputCharacter
>>> print separatedPunctuation
help , me
>>> print separatedPunctuation.split()
['help', ',', 'me']

これにより、私が望む結果が得られますが、大きなファイルでは非常に遅くなります。

これをより効率的に行う方法はありますか？

score 100 · Accepted Answer

これは多かれ少なかれそれを行う方法です：

>>> import re
>>> re.findall(r"[\w']+|[.,!?;]", "Hello, I'm a string!")
['Hello', ',', "I'm", 'a', 'string', '!']

秘訣は、文字列をどこで分割するかではなく、トークンに何を含めるかを考えることです。

警告:

アンダースコア (_) は単語内文字と見なされます。必要ない場合は、\w を置き換えてください。
これは、文字列内の (一重) 引用符では機能しません。
使用したい追加の句読点を正規表現の右半分に入れます。
re で明示的に言及されていないものは、黙って削除されます。

score 43 · Accepted Answer

Unicode 対応バージョンは次のとおりです。

re.findall(r"\w+|[^\w\s]", text, re.UNICODE)

最初の選択肢は、一連の単語文字をキャッチします (Unicode で定義されているため、「履歴書」はに変わりません['r', 'sum'])。2 つ目は、空白を無視して単語以外の個々の文字をキャッチします。

一番上の回答とは異なり、これは単一引用符を個別の句読点として扱うことに注意してください (例: "I'm" -> ['I', "'", 'm'])。これは NLP の標準のように見えるので、機能だと思います。

score 7 · Accepted Answer

これが私のエントリです。

これが効率の点でどれだけうまくいくか、またはすべてのケースをキャッチするかどうかについては疑問があります（「!!!」がグループ化されていることに注意してください。これは良いことかもしれませんし、そうでないかもしれません）。

>>> import re
>>> import string
>>> s = "Helo, my name is Joe! and i live!!! in a button; factory:"
>>> l = [item for item in map(string.strip, re.split("(\W+)", s)) if len(item) > 0]
>>> l
['Helo', ',', 'my', 'name', 'is', 'Joe', '!', 'and', 'i', 'live', '!!!', 'in', 'a', 'button', ';', 'factory', ':']
>>>

明らかな最適化の 1 つは、これを行ごとに行う場合、事前に (re.compile を使用して) 正規表現をコンパイルすることです。

score 1 · Accepted Answer

これは、実装に対するマイナーな更新です。より詳細なことをしようとしている場合は、le dorfier が提案した NLTK を調べることをお勧めします。

''.join() が += の代わりに使用されているため、これは少しだけ速いかもしれません。

import string

d = "Hello, I'm a string!"

result = []
word = ''

for char in d:
    if char not in string.whitespace:
        if char not in string.ascii_letters + "'":
            if word:
                    result.append(word)
            result.append(char)
            word = ''
        else:
            word = ''.join([word,char])

    else:
        if word:
            result.append(word)
            word = ''
print result
['Hello', ',', "I'm", 'a', 'string', '!']

score 0 · Accepted Answer

特にPythonを使用しているので、 NLTKで想像できるすべてのヘルプを見つけることができると思います。チュートリアルには、この問題に関する包括的な議論があります。

score 0 · Accepted Answer

これを試して：

string_big = "One of Python's coolest features is the string format operator  This operator is unique to strings"
my_list =[]
x = len(string_big)
poistion_ofspace = 0
while poistion_ofspace < x:
    for i in range(poistion_ofspace,x):
        if string_big[i] == ' ':
            break
        else:
            continue
    print string_big[poistion_ofspace:(i+1)]
    my_list.append(string_big[poistion_ofspace:(i+1)])
    poistion_ofspace = i+1

print my_list

score -1 · Accepted Answer

正規表現を使ってみましたか?

http://docs.python.org/library/re.html#re-syntax

ところで。2番目に「、」が必要なのはなぜですか？各テキストが書かれた後、つまり

[0]

"、"

[1]

"、"

したがって、「、」を追加したい場合は、配列を使用するときに各反復の後にそれを行うことができます..

python - 文字列を単語と句読点に分割する

11 に答える 11

Related

Reference