regex - 正規表現：2つの一致の間の否定的な先読み

Question

私は次のような正規表現を作成しようとしています：

[match-word] ... [exclude-specific-word] ... [match-word]

これはネガティブな先読みで機能するようですが、次のような場合に問題が発生します。

[match-word] ... [exclude-specific-word] ... [match-word] ... [excluded word appears again]

上記の文を一致させたいのですが、最初に一致した単語と2番目に一致した単語の間の負の先読みが「こぼれ」、2番目の単語が一致することはありません。

実際の例を見てみましょう。

「i」という単語と「pie」という単語が含まれるすべての文を一致させたくはありませんが、これら2つの単語の間に「hate」という単語を含めることはできません。私はこれらの3つの文を持っています：

i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this

私はこの正規表現を持っています：

^i(?!.*hate).*pie          - have removed the word boundaries for clarity, original is: ^i\b(?!.*\bhate\b).*\bpie\b

これは最初の文と一致しますが、2番目の文とは一致しません。これは、負の先読みが文字列全体をスキャンするためです。

ネガティブな先読みを制限して、「憎しみ」に遭遇する前に「パイ」に遭遇した場合に満足するようにする方法はありますか？

注：私の実装では、この正規表現の後に他の用語が存在する可能性があります（文法検索エンジンから動的に構築されます）。たとえば、次のようになります。

^i(?!.*hate).*pie.*donuts

現在JRegexを使用していますが、必要に応じてJDKRegexに切り替えることができます。

更新：最初の質問で何かを言及するのを忘れました：

「ネガティブコンストラクト」が文のさらに上に存在する可能性があります。「ネガティブ」コンストラクトがさらに上に存在する場合でも、可能であれば文と一致させたいと思います。

明確にするために、これらの文を見てください：

i sure like eating pie, but i love donuts <- Want to match this
i sure like eating pie, but i hate donuts <- Want to match this
i sure hate eating pie, but i like donuts <- Don't want to match this
i sure like eating pie, but i like donuts and i hate making pie <- Do want to match this

robの答えは、この追加の制約に対して完全に機能するので、私はそれを受け入れています。

score 5 · Accepted Answer

スタートワードとストップワードの間のすべての文字で、ネガティブワードまたはストップワードと一致しないことを確認する必要があります。このように（読みやすくするために少し空白を含めました）：

^i ( (?!hate|pie) . )* pie

これがテスト用のPythonプログラムです。

import re

test = [ ('i sure like eating pie, but i love donuts', True),
         ('i sure like eating pie, but i hate donuts', True),
         ('i sure hate eating pie, but i like donuts', False) ]

rx = re.compile(r"^i ((?!hate|pie).)* pie", re.X)

for t,v in test:
    m = rx.match(t)
    print t, "pass" if bool(m) == v else "fail"

score 3 · Accepted Answer

この正規表現はあなたのために働くはずです

^(?!i.*hate.*pie)i.*pie.*donuts

説明

"^" +          // Assert position at the beginning of a line (at beginning of the string or after a line break character)
"(?!" +        // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
   "i" +          // Match the character “i” literally
   "." +          // Match any single character that is not a line break character
      "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   "hate" +       // Match the characters “hate” literally
   "." +          // Match any single character that is not a line break character
      "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   "pie" +        // Match the characters “pie” literally
")" +
"i" +          // Match the character “i” literally
"." +          // Match any single character that is not a line break character
   "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"pie" +        // Match the characters “pie” literally
"." +          // Match any single character that is not a line break character
   "*" +          // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"donuts"       // Match the characters “donuts” literally

score 2 · Accepted Answer

C間で一致しない...A...B...

テストインpython：

$ python
>>> import re
>>> re.match(r'.*A(?!.*C.*B).*B', 'C A x B C')
<_sre.SRE_Match object at 0x94ab7c8>

だから私はこの正規表現を取得します：

.*\bi\b(?!.*hate.*pie).*pie

regex - 正規表現：2つの一致の間の否定的な先読み

3 に答える 3

Related

Reference