python - Pythonでの複数行のパターンマッチング

Question

定期的なコンピューター生成メッセージ（簡略化）：

Hello user123,

- (604)7080900
- 152
- minutes

Regards

"- "Pythonを使用して、2つの空の行（空の行は\n\n「Hellouser123」の後と\n\n「よろしく」の前）の間の「(604)7080900」、「152」、「分」（つまり、先頭のパターンに続く任意のテキスト）を抽出するにはどうすればよいですか？"）。結果の文字列リストが配列に格納されているとさらに便利です。ありがとう！

編集：2つの空白行の間の行数は固定されていません。

2番目の編集：

例えば

hello

- x1
- x2
- x3

- x4

- x6
morning
- x7

world

x1 x2 x3は適切です。すべての行が2つの空の行で囲まれているため、同じ理由でx4も適切です。x6はそれに続く空白行がないため適切ではなく、x7はその前に空白がないため適切ではありません。x2は適切です（x6、x7とは異なります）。これは、前の行が適切な行であり、それに続く行も適切であるためです。

私が質問を投稿したとき、この条件は明確ではないかもしれません：

a continuous of good lines between 2 empty lines

good line must have leading "- "
good line must follow an empty line or follow another good line
good line must be followed by an empty line or followed by another good line

ありがとう

score 4 · Accepted Answer

>>> import re
>>>
>>> x="""Hello user123,
...
... - (604)7080900
... - 152
... - minutes
...
... Regards
... """
>>>
>>> re.findall("\n+\n-\s*(.*)\n-\s*(.*)\n-\s*(minutes)\s*\n\n+",x)
[('(604)7080900', '152', 'minutes')]
>>>

score 3 · Accepted Answer

最も簡単な方法は、これらの行を調べて（行のリストまたはファイルがあると仮定するか、文字列を行のリストに分割する）、次の行が表示されるまで'\n'調べてから、各行が'- '（startswith文字列メソッド）そしてそれをスライスして、別の空の行が見つかるまで結果を保存します。例えば：

# if you have a single string, split it into lines.
L = s.splitlines()
# if you (now) have a list of lines, grab an iterator so we can continue
# iteration where it left off.
it = iter(L)
# Alternatively, if you have a file, just use that directly.
it = open(....)

# Find the first empty line:
for line in it:
    # Treat lines of just whitespace as empty lines too. If you don't want
    # that, do 'if line == ""'.
    if not line.strip():
        break
# Now starts data.
for line in it:
    if not line.rstrip():
        # End of data.
        break
    if line.startswith('- '):
        data.append(line[:2].rstrip())
    else:
        # misformed data?
        raise ValueError, "misformed line %r" % (line,)

編集：やりたいことを詳しく説明しているので、ループの更新バージョンを次に示します。2回ループすることはなくなりましたが、代わりに「不良」行が検出されるまでデータを収集し、ブロックセパレーターが検出されると、収集された行を保存または破棄します。反復を再開しないため、明示的なイテレータは必要ありません。したがって、行のリスト（または任意の反復可能）を渡すことができます。

def getblocks(L):
    # The list of good blocks (as lists of lines.) You can also make this
    # a flat list if you prefer.
    data = []
    # The list of good lines encountered in the current block
    # (but the block may still become bad.)
    block = []
    # Whether the current block is bad.
    bad = 1
    for line in L:
        # Not in a 'good' block, and encountering the block separator.
        if bad and not line.rstrip():
            bad = 0
            block = []
            continue
        # In a 'good' block and encountering the block separator.
        if not bad and not line.rstrip():
            # Save 'good' data. Or, if you want a flat list of lines,
            # use 'extend' instead of 'append' (also below.)
            data.append(block)
            block = []
            continue
        if not bad and line.startswith('- '):
            # A good line in a 'good' (not 'bad' yet) block; save the line,
            # minus
            # '- ' prefix and trailing whitespace.
            block.append(line[2:].rstrip())
            continue
        else:
            # A 'bad' line, invalidating the current block.
            bad = 1
    # Don't forget to handle the last block, if it's good
    # (and if you want to handle the last block.)
    if not bad and block:
        data.append(block)
    return data

そして、ここでそれが実行されています：

>>> L = """hello
...
... - x1
... - x2
... - x3
...
... - x4
...
... - x6
... morning
... - x7
...
... world""".splitlines()
>>> print getblocks(L)
[['x1', 'x2', 'x3'], ['x4']]

score 1 · Accepted Answer

l = """Hello user123,

- (604)7080900
- 152
- minutes

Regards  

Hello user124,

- (604)8576576
- 345
- minutes
- seconds
- bla

Regards"""

これを行う：

result = []
for data in s.split('Regards'): 
    result.append([v.strip() for v in data.split('-')[1:]])
del result[-1] # remove empty list at end

そしてこれを持っている：

>>> result
[['(604)7080900', '152', 'minutes'],
['(604)8576576', '345', 'minutes', 'seconds', 'bla']]

score 1 · Accepted Answer

>>> s = """Hello user123,

- (604)7080900
- 152
- minutes

Regards
"""
>>> import re
>>> re.findall(r'^- (.*)', s, re.M)
['(604)7080900', '152', 'minutes']

python - Pythonでの複数行のパターンマッチング

4 に答える 4

Related

Reference