python - Pythonで区切り文字を保持する文字列をトークン化する

Question

str.split区切り文字も返すPythonに相当するものはありますか？

いくつかのトークンを処理した後、出力用に空白のレイアウトを保持する必要があります。

例：

>>> s="\tthis is an  example"
>>> print s.split()
['this', 'is', 'an', 'example']

>>> print what_I_want(s)
['\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

ありがとう！

score 19 · Accepted Answer

19

どうですか

import re
splitter = re.compile(r'(\s+|\S+)')
splitter.findall(s)

于 2009-11-30T15:08:11.833 に答える

score 6 · Accepted Answer

>>> re.compile(r'(\s+)').split("\tthis is an  example")
['', '\t', 'this', ' ', 'is', ' ', 'an', '  ', 'example']

score 4 · Accepted Answer

reモジュールはこの機能を提供します：

>>> import re
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']

（Pythonドキュメントから引用）。

あなたの例（空白で分割）では、を使用しますre.split('(\s+)', '\tThis is an example')。

重要なのは、括弧をキャプチャする際に分割する正規表現を囲むことです。このようにして、区切り文字が結果のリストに追加されます。

編集：指摘したように、前/後の区切り文字ももちろんリストに追加されます。.strip()これを回避するには、最初に入力文字列でメソッドを使用できます。

score 3 · Accepted Answer

pyparsingを見たことがありますか？pyparsing wikiから借用した例：

>>> from pyparsing import Word, alphas
>>> greet = Word(alphas) + "," + Word(alphas) + "!"
>>> hello1 = 'Hello, World!'
>>> hello2 = 'Greetings, Earthlings!'
>>> for hello in hello1, hello2:
...     print (u'%s \u2192 %r' % (hello, greet.parseString(hello))).encode('utf-8')
... 
Hello, World! → (['Hello', ',', 'World', '!'], {})
Greetings, Earthlings! → (['Greetings', ',', 'Earthlings', '!'], {})

score -1 · Accepted Answer

モジュールを指してくれてありがとうre、私はまだそれとシーケンスを返す自分の関数を使用するかどうかを決定しようとしています...

def split_keep_delimiters(s, delims="\t\n\r "):
    delim_group = s[0] in delims
    start = 0
    for index, char in enumerate(s):
        if delim_group != (char in delims):
            delim_group ^= True
            yield s[start:index]
            start = index
    yield s[start:index+1]

時間があれば、xDのベンチマークを行います

python - Pythonで区切り文字を保持する文字列をトークン化する

5 に答える 5

Related

Reference