regex - 正規表現: 最後の開き括弧の直後のテキスト

Question

私は RegEx について少し知識がありますが、現時点では私の能力をはるかに超えています。

一致する閉じ括弧を持たない最後の開き括弧の直後のテキスト/式を見つけるのに助けが必要です。

開発中のオープンソースソフトウェア(Object Pascal)のCallTip用です。

以下にいくつかの例を示します。

------------------------------------
Text                  I need
------------------------------------
aaa(xxx               xxx
aaa(xxx,              xxx
aaa(xxx, yyy          xxx
aaa(y=bbb(xxx)        y=bbb(xxx)
aaa(y <- bbb(xxx)     y <- bbb(xxx)
aaa(bbb(ccc(xxx       xxx
aaa(bbb(x), ccc(xxx   xxx
aaa(bbb(x), ccc(x)    bbb(x)
aaa(bbb(x), ccc(x),   bbb(x)
aaa(?, bbb(??         ??
aaa(bbb(x), ccc(x))   ''
aaa(x)                ''
aaa(bbb(              ''
------------------------------------

For all text above the RegEx proposed by @Bohemian
(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(?=[ ,]|$)(?! <-)(?<! <-)
matches all cases.

For the below (I found these cases when implementing the RegEx in the software) not
------------------------------------
New text              I need
------------------------------------
aaa(bbb(x, y)         bbb(x, y)
aaa(bbb(x, y, z)      bbb(x, y, z)
------------------------------------

これらの状況で RegEx (PCRE) を作成することは可能ですか?

以前の投稿 ( RegEx: 最後の開き括弧の直前の単語) で、Alan Moore (多くのおかげで新しく) は、以下の正規表現を使用して最後の開き括弧の直前のテキストを見つけるのを手伝ってくれました:

\w+(?=\((?:[^()]*\([^()]*\))*[^()]*$)

ただ、直後に合わせて適切な調整ができませんでした。

誰でも助けてくれますか？

score 6 · Accepted Answer

これは、この問題に似ています。そして、再帰構文を使用して PCRE を使用しているため、実際には解決策があります。

/
(?(DEFINE)                # define a named capture for later convenience
  (?P<parenthesized>      # define the group "parenthesized" which matches a
                          # substring which contains correctly nested
                          # parentheses (it does not have to be enclosed in
                          # parentheses though)
    [^()]*                # match arbitrarily many non-parenthesis characters
    (?:                   # start non capturing group
      [(]                 # match a literal opening (
      (?P>parenthesized)  # recursively call this "parenthesized" subpattern
                          # i.e. make sure that the contents of these literal ()
                          # are also correctly parenthesized
      [)]                 # match a literal closing )
      [^()]*              # match more non-parenthesis characters
    )*                    # repeat
  )                       # end of "parenthesized" pattern
)                         # end of DEFINE sequence

# Now the actual pattern begins

(?<=[(])                  # ensure that there is a literal ( left of the start
                          # of the match
(?P>parenthesized)?       # match correctly parenthesized substring
$                         # ensure that we've reached the end of the input
/x                        # activate free-spacing mode

このパターンの要点は、明らかにparenthesizedサブパターンです。それについてもう少し詳しく説明する必要があるかもしれません。その構造は次のとおりです。

(normal* (?:special normal*)*)

とはどこnormalですか。この手法は「アンローリング・ザ・ループ」と呼ばれます。構造を持つものすべてに一致するために使用されます[^()]special[(](?P>parenthesized)[)]

nnnsnnsnnnnsnnsnn

はと一致nしnormal、sはと一致しspecialます。

ただし、この特定のケースでは、再帰も使用しているため、少し複雑です。(その一部である) パターンを(?P>parenthesized)再帰的に使用します。後方参照に少し似た構文parenthesizedを見ることができますが、エンジンはグループが一致したものを一致させようとせず、代わりにそのサブパターンを再度適用します。(?P>...)...

また、私のパターンは、正しく括弧で囲まれたパターンの空の文字列を提供しませんが、失敗することに注意してください。後読みを除外することで、これを修正できます。エンジンは常に左端の一致を返すため、後読みは実際には必要ありません。

編集： 2つの例から判断すると、実際には、最後の一致しない括弧の後のすべてが必要なわけではなく、最初のコンマまでのすべてが必要です。私の結果を使用して分割する,か、ボヘミアンの答えを試すことができます。

参考文献：

PCRE サブパターン(名前付きグループを含む)
PCRE 再帰
「Unrolling-the-loop」は、Jeffrey Friedl の著書Mastering Regular Expressionsで紹介されましたが、上記のリンク先の投稿で概要がよくわかると思います。
Using(?(DEFINE)...)は、実際にはconditional patternsと呼ばれる別の機能を悪用しています。PCREのman ページでは、それがどのように機能するかを説明しています。「参照のみで使用するためのサブパターンの定義」のページを検索してください。

編集: Object Pascal を使用していることが質問で言及されていることに気付きました。その場合、実際には PCRE を使用していない可能性があります。これは、再帰がサポートされていないことを意味します。その場合、問題に対する完全な正規表現の解決策はありません。（すべての例のように）「最後の一致しない括弧の後にもう1つのネストレベルしか存在できない」などの制限を課せば、解決策を見つけることができます。ここでも、"unrolling-the-loop" を使用して、フォームの部分文字列を照合しxxx(xxx)xxx(xxx)xxxます。

(?<=[(])         # make sure we start after an opening (
(?=              # lookahead checks that the parenthesis is not matched
  [^()]*([(][^()]*[)][^()]*)*
                 # this matches an arbitrarily long chain of parenthesized
                 # substring, but allows only one nesting level
  $              # make sure we can reach the end of the string like this
)                # end of lookahead
[^(),]*([(][^()]*[)][^(),]*)*
                 # now actually match the desired part. this is the same
                 # as the lookahead, except we do not allow for commas
                 # outside of parentheses now, so that you only get the
                 # first comma-separated part

aaa(xxx(yyy())一致させたい場所のような入力例を追加した場合xxx(yyy())、このアプローチは一致しません。実際、再帰を使用しない正規表現は、任意のネストレベルを処理できません。

あなたの正規表現は再帰をサポートしていないので、おそらく正規表現をまったく使用しない方がよいでしょう。私の最後の正規表現があなたの現在のすべての入力例と一致したとしても、それは本当に複雑で、おそらく苦労する価値はありません. 代わりにこれはどうですか: 文字列を 1 文字ずつ歩き、括弧の位置のスタックを維持します。次に、次の疑似コードは、最後の unmatched の後のすべてを提供します(。

while you can read another character from the string
    if that character is "(", push the current position onto the stack
    if that character is ")", pop a position from the stack
# you've reached the end of the string now
if the stack is empty, there is no match
else the top of the stack is the position of the last unmatched parenthesis;
     take a substring from there to the end of the string

次に、ネストされていない最初のコンマまですべてを取得するには、その結果をもう一度たどることができます。

nestingLevel = 0
while you can read another character from the string
    if that character is "," and nestingLevel == 0, stop
    if that character is "(" increment nestingLevel
    if that character is ")" decrement nestingLevel
take a substring from the beginning of the string to the position at which
  you left the loop

これらの 2 つの短いループは、将来誰でも簡単に理解できるようになり、正規表現のソリューション (少なくとも 1 つは再帰なし) よりもはるかに柔軟になります。

score 1 · Accepted Answer

先読みを使用します。

(?<=\()(?=([^()]*\([^()]*\))*[^()]*$).*?(\(.*?\))?(?=[ ,]|$)(?! <-)(?<! <-)

質問に投稿されたすべてのテストケースに合格するルビュラーで実行されているこれを参照してください。

regex - 正規表現: 最後の開き括弧の直後のテキスト

2 に答える 2

Related

Reference