python - 複数の区切り記号を使用し、各区切り記号を保持して文字列を効率的に分割しますか?

Question

string.punctuation各文字をstring.whitespaceセパレータとして使用して、データの文字列を分割する必要があります。

さらに、文字列で区切られた項目の間に、区切り記号が出力リストに残る必要があります。

例えば、

"Now is the winter of our discontent"

出力する必要があります:

['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

容認できないほど遅いネストされたループの乱交に頼らずにこれを行う方法がわかりません。どうすればいいですか？

score 21 · Accepted Answer

他とは異なる非正規表現アプローチ:

>>> import string
>>> from itertools import groupby
>>> 
>>> special = set(string.punctuation + string.whitespace)
>>> s = "One two  three    tab\ttabandspace\t end"
>>> 
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)]
>>> split_combined
['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)]
>>> split_separated
['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']

の代わりにdict.fromkeysandを使用できると思います。.getlambda

[編集]

いくつかの説明：

groupbyiterable と (オプションの) keyfunction の 2 つの引数を受け入れます。iterable をループし、keyfunction の値でグループ化します。

>>> groupby("sentence", lambda c: c in 'nt')
<itertools.groupby object at 0x9805af4>
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')]
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]

ここで、keyfunction の値が連続している項がグループ化されます。(実際、これはバグの一般的な原因です。連続していない可能性のある用語をグループ化したい場合、最初に keyfunc でソートする必要があることを人々は忘れています。)

@JonClementsが推測したように、私が考えていたのは

>>> special = dict.fromkeys(string.punctuation + string.whitespace, True)
>>> s = "One two  three    tab\ttabandspace\t end"
>>> [''.join(g) for k,g in groupby(s, special.get)]
['One', ' ', 'two', '  ', 'three', '    ', 'tab', '\t', 'tabandspace', '\t ', 'end']

セパレーターを組み合わせていた場合。値が辞書にない場合に.get返します。None

score 7 · Accepted Answer

import re
import string

p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
    string.punctuation + string.whitespace)))

print p.findall("Now is the winter of our discontent")

私はすべての問題に正規表現を使用するのが好きというわけではありませんが、迅速かつ短時間で処理したい場合は、これに多くの選択肢があるとは思いません。

あなたはそれに慣れていないので、正規表現について説明します。

[...]角括弧内の任意の文字を意味します
[^...]角括弧内にない任意の文字を意味します
+後ろは前のものの1つ以上を意味します
x|yxまたはのいずれかに一致することを意味しますy

したがって、正規表現は、すべてが句読点と空白である必要があるか、またはまったくない必要がある 1 つ以上の文字に一致します。このfindallメソッドは、パターンの重複しない一致をすべて見つけます。

score 4 · Accepted Answer

これを試して：

import re
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")

Python ドキュメントからの説明:

pattern でキャプチャ用の括弧が使用されている場合、パターン内のすべてのグループのテキストも結果のリストの一部として返されます。

score 3 · Accepted Answer

線形 ( O(n)) 時間での解:

文字列があるとしましょう：

original = "a, b...c    d"

最初にすべてのセパレーターをスペースに変換します。

splitters = string.punctuation + string.whitespace
trans = string.maketrans(splitters, ' ' * len(splitters))
s = original.translate(trans)

今s == 'a b c d'。itertools.groupbyスペースと非スペースを交互に使用できるようになりました。

result = []
position = 0
for _, letters in itertools.groupby(s, lambda c: c == ' '):
    letter_count = len(list(letters))
    result.append(original[position:position + letter_count])
    position += letter_count

今result == ['a', ', ', 'b', '...', 'c', ' ', 'd']、あなたが必要とするものです。

score 1 · Accepted Answer

扱っているテキストによっては、区切り記号の概念を「文字と数字以外のもの」に単純化できる場合があります。これが機能する場合は、次の正規表現ソリューションを使用できます。

re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)

これは、区切り文字が連続していても、個々の区切り文字で分割することを前提としているため、'foo..bar'になり['foo', '.', '.', 'bar']ます。代わりにを期待する場合は['foo', '..', 'bar']、使用します[a-zA-Z\d]+|[^a-zA-Z\d]+(違いは最後に追加することだけです+)。

score 1 · Accepted Answer

私の見解：

from string import whitespace, punctuation
import re

pattern = re.escape(whitespace + punctuation)
print re.split('([' + pattern + '])', 'now is the winter of')

score 0 · Accepted Answer

区切り記号の任意のコレクションの場合:

def separate(myStr, seps):
    answer = []
    temp = []
    for char in myStr:
        if char in seps:
            answer.append(''.join(temp))
            answer.append(char)
            temp = []
        else:
            temp.append(char)
    answer.append(''.join(temp))
    return answer

In [4]: print separate("Now is the winter of our discontent", set(' '))
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-'))
['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

お役に立てれば

score 0 · Accepted Answer

from string import punctuation, whitespace

s = "..test. and stuff"

f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c
l =  sum([reduce(f, word).split() for word in s.split()], [])

print l

score -1 · Accepted Answer

from itertools import chain, cycle, izip

s = "Now is the winter of our discontent"
words = s.split()

wordsWithWhitespace = list( chain.from_iterable( izip( words, cycle([" "]) ) ) )
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']

python - 複数の区切り記号を使用し、各区切り記号を保持して文字列を効率的に分割しますか?

9 に答える 9

Related

Reference