python - 同じ単語が別のキーワードの前に現れないキーワードに一致する正規表現

Question

「テスト」という単語の後に「フォロー」が続き、その間に別の「テスト」が表示されるタイミングを見つける必要があります。

例：

test
word
word
word
test
test
word
word
follow
word
word
test

私はこれだけが欲しい：

test
word
word
word
test
**test**
**word**
**word**
**follow**
word
word
test

ただし、これを行うには正規表現に精通していません。どんなアドバイスも素晴らしいでしょう。

編集 test という単語は複数回含まれますが、 follow という単語は文字列内に 1 回だけ含まれます。

score 2 · Accepted Answer

ここで先読みを使用するには、正規表現が必要です。

test(?:\w|\s(?!test))+?follow

(?:)非キャプチャグループです。\w任意の単語文字に一致します[a-zA-Z0-9_]。\s任意の空白 (改行を含む) に一致します。\s(?!test)が続かない改行のみに一致しますtest(正規表現用語では否定先読みと呼ばれます)。()+?一致を非貪欲にするだけです。

一致する入力のテスト:

test
word
**test**
**word**
**follow**
word
test
**test**
**word**
**word**
**follow**
word
word
**test**
**word**
**follow**

次の正規表現は、部分文字列の一致も排除します (test in testing、protest など)。

(?<!\w)(test)\s(?!\1\s)(?:\w|\s(?!\1\s))*?(?<!\w)follow(?!\w)

score 0 · Accepted Answer

Ravi の正規表現パターンは、場合によっては誤った結果を生成します。
例：

import re
s = """test
word 1
word 2
word 3
test 
tutulululalalo
testimony
word A
word B
follow
word X
word Y
test
"""

pat = ('test(?:\w|\s(?!test))+?follow')
print re.findall(pat,s)
#
#result : ['testimony\nword A\nword B\nfollow']

パターンは次のようにする必要があります。

pat = ('test(?=\s)'  '(?:\w|\s(?!test(?=\s)))+?'  'follow')
print re.findall(pat,s)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

それに、OR式の面白みも感じられない。これは機能します：

pat = ('(test(?=\s)'  '(?:.(?!test(?=\s)))+?'  'follow)')
print re.findall(pat,s,re.DOTALL)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

最後に、次のパターンを好みます。最初の「test」と最後の「follow」の間に「test」がないことを 1 回のパスで検証
し、その後に「follow」が続くかどうかを各文字で検証するからです。'(?:\w|\s(?!test(?=\s)))+?''(?:.(?!test(?=\s)))+?'

pat = ('test(?=\s)'
       '(?!.+?test(?=\s).*?follow)'
       '.+?'
       'follow')
print re.findall(pat,s,re.DOTALL)
#
#result : ['test \ntutulululalalo\ntestimony\nword A\nword B\nfollow']

.

編集1

Ravi Thapliyal が指摘したように、私の最後の正規表現パターン

pat = ('test(?=\s)'
       '(?!.+?test(?=\s).*?follow)'
       '.+?'
       'follow')

も完璧ではありません。
パターンが気に入らなかったので、このパターンを試してみました。(?!.(?=something))+
私の最後の正規表現パターンは、このありそうもないパターンを置き換えるはずでした。
まあ、それは機能しません。それを機能させるための私のすべての努力は成功しませんでしたが、昔々、機能するようにする微妙な追加部分を備えたそのようなパターンを使用したように思えます。
残念ながら、私は成功しませんでした。いつかうまくいくかもしれないという考えは捨てようと思います。
そこで私は、昔ながらの考えを捨てて、決して好きではなかったパターンが最も明白で、理解しやすく、書きやすいパターンであると明確に考えることにしました。

.

ここで、告白しなければならない 2 つ目の欠点があります。Ravi Thapliyal の正規表現パターンが機能しない場合があることがわかりましたが、考えられる失敗のすべてのケースを考えていませんでした。
ただし、修正するのは簡単です。test(?=\s)1 つの先読みアサーションだけで書く代わりに(?<=\s)test(?=\s)、後読みアサーションと先読みアサーションで書くべきでした。

Ravi Thapliyal が作成することを選択しましたが、この作成にはいくつか(?<!\w)(test)\s(?!\1\s)の欠点が
あります。(?!\\1\s)(?!\1\s)
(test)re.findall()re.finditer()

彼はまた書いています(?:\w|\s(?!\\1\s))*?。(?!.(?=something))+ ドットを使用しても同じことができるのに、OR 式を使用してパターンを複雑にすることには興味がありません。

さらに、私にとって最もデフォルトのポイントは、Ravi の正規表現パターンが、記号化された文字以外の文字を含む文字列と一致できないことです。\w

これらすべての理由から、次の修正されたソリューションを提案します。

import re

s1 = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word
unfollow
word B
follow
word X
test
word Y
follow
"""

s2 = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow
"""

.

# eyquem's pattern
fu = '(?<=\s)%s(?=\s)'
a = fu % 'test'
z = fu % 'follow'
pat = ('%s'
       '(?:(?!%s).)+?'
       '%s'
       % (a,a,z))

# Ravi's pattern
patRT = ('(?<!\w)(test)\s'
         '(?:\w|\s(?!\\1\s))*?(?<!\w)follow(?!\w)')


for x in (s1,s2):
    print x
    print re.findall(pat,x,re.DOTALL)
    print
    print [m.group() for m in re.finditer(patRT,x)]
    print

結果

test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word
unfollow
word B
follow
word X
test
word Y
follow

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

.

test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow

['test\nword 3\ntutulululalalo\nprotest\ntestimony\nword ???????\nunfollow\nword B\nfollow', 'test\nword Y\nfollow']

['test\nword Y\nfollow'

]

.

編集2

質問されたとおりに正確に答えるには、次のようにします。

s = """test
word 1
word 2
test
word 3
tutulululalalo
protest
testimony
word ???????
unfollow
word B
follow
word X
test
word Y
follow
"""
fu = '(?<=\s)%s(?=\s)'
a,z = fu % 'test' ,  fu % 'follow'
pat = ('%s'
       '(?:(?!%s).)+?'
       '%s'
       % (a,a,z))

def ripl(m):
    return re.sub('(?m)^(.*)$','**\\1**',m.group())

print re.sub(pat,ripl,s,flags=re.DOTALL)

ripl()置換を実行するために使用される関数であり、RegexMatch オブジェクトの形式で各一致を受け取り、置換を行うためにによって使用される変換された部分を返しますre.sub()。

python - 同じ単語が別のキーワードの前に現れないキーワードに一致する正規表現

3 に答える 3

編集1

編集2

Related

Reference