c# - 文字列がC#でネストされた括弧を含む可能性のある特定のパターン内にあるかどうかを確認する正規表現

Question

指定された文字列に特定のパターンの特定の文字列が含まれているかどうかを確認するコードを作成しようとしています。正確には、例えば：

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};

今、私は抽出したい

"homo sapiens", "human" and "woman" but NOT "man"

上記のリストから、パターンに従って、つまり、文字列の後に ~ が続くか、~ で始まる括弧内の文字列のいずれかになります。これまでのところ、私は思いついた：

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman"
List<string> checkList = new List<string>{"homo sapiens","human","man","woman"};
var prunedList = new List<string>();
foreach(var term in checkList)
{
   var pattern = @"~(\s)*(\(\s*)?(\(?\w\s*\)?)*" + term + @"(\s*\))?";
   Match m = Regex.Match(mainString, pattern);
   if(m.success)
   {
      prunedList.Add(term);
   }
 }

しかし、このパターンはすべてのケースで機能するとは限りません...これを行う方法を教えてもらえますか?

score 2 · Accepted Answer

私はあなたが与えた例にうまく機能する簡単なパーサーを書きました。

このパターンで終わる文字列に期待される動作がわかりません:( ~(some wordsつまり、有効な開始で閉じ括弧がない）

これを片付けることができると確信しています...

private bool Contains(string source, string given)
{
    return ExtractValidPhrases(source).Any(p => RegexMatch(p, given));
}

private bool RegexMatch(string phrase, string given)
{
    return Regex.IsMatch(phrase, string.Format(@"\b{0}\b", given), RegexOptions.IgnoreCase);
}

private IEnumerable<string> ExtractValidPhrases(string source)
{
    bool valid = false;
    var parentheses = new Stack<char>();
    var phrase = new StringBuilder();

    for(int i = 0; i < source.Length; i++)
    {
        if (valid) phrase.Append(source[i]);

        switch (source[i])
        {
            case '~':
                valid = true;
                break;

            case ' ':
                if (valid && parentheses.Count == 0)
                {
                    yield return phrase.ToString();
                    phrase.Clear();
                }
                if (parentheses.Count == 0) valid = false;
                break;

            case '(':
                if (valid)
                {
                    parentheses.Push('(');
                }
                break;

            case ')':
                if (valid)
                {
                    parentheses.Pop();
                }
                break;
        }
    }

    //if (valid && !parentheses.Any()) yield return phrase.ToString();
    if (valid) yield return phrase.ToString();
}

これが私が使用したテストです：

// NUnit tests
[Test]
[TestCase("Homo Sapiens", true)]
[TestCase("human", true)]
[TestCase("woman", true)]
[TestCase("man", false)]
public void X(string given, bool shouldBeFound)
{
    const string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";

    Assert.AreEqual(shouldBeFound, Contains(mainString, given));
}

[Test]
public void Y()
{
    const string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";
    var checkList = new List<string> {"homo sapiens", "human", "man", "woman"};
    var expected = new List<string> { "homo sapiens", "human", "woman" };

    var filtered = checkList.Where(s => Contains(mainString, s));

    CollectionAssert.AreEquivalent(expected, filtered);
}

score 2 · Accepted Answer

バランスの取れた括弧の言語は規則的ではないため、RegEx を使用して目的を達成することはできません。より良いアプローチは、従来の文字列解析を使用して、2 つのカウンター (開き括弧用と閉じ括弧用) を使用するか、スタックを使用して、プッシュダウンオートマトンに似たモデルを作成することです。

概念をよりよく理解するには、ウィキペディアの PDA を参照してください。http://en.wikipedia.org/wiki/Pushdown_automaton

以下は、スタックを使用して最も外側の括弧内の文字列を取得する例です (疑似コード)。

 Stack stack = new Stack();
 char[] stringToParse = originalString.toCharArray();

 for (int i = 0; i < stringToParse.Length; i++)
 {
      if (stringToParse[i] == '(')
            stack.push(i);
      if (stringToParse[i] == ')')
         string StringBetweenParens = originalString.GetSubstring(stack.pop(), i);
 }

もちろん、これは不自然な例であり、より本格的な解析を行うには多少の作業が必要ですが、それを行う方法の基本的な考え方はわかります。次のようなものは省略しました。正しい関数名 (今は調べる気はありません)、文字列 "(outer (inner))" から "inner" を取得するなど、ネストされた括弧内のテキストを取得する方法 (その関数は "outer (inner) を返します)")、および返された文字列を格納する方法。

score 2 · Accepted Answer

学術的な理由から、正規表現のソリューションも紹介したいと思います。おそらく、これを解決できる唯一の正規表現エンジンを使用しているからです。

.NET 独自の機能の組み合わせに関するいくつかの興味深い問題を解決した後、目的の結果を得るコードを次に示します。

string mainString = @"~(Homo Sapiens means (human being)) or man or ~woman";
List<string> checkList = new List<string> { "homo sapiens", "human", "man", "woman" };

// build subpattern "(?:homo sapiens|human|man|woman)"
string searchAlternation = "(?:" + String.Join("|", checkList.ToArray()) + ")";

MatchCollection matches = Regex.Matches(
    mainString,
    @"(?<=~|(?(Depth)(?!))~[(](?>[^()]+|(?<-Depth>)?[(]|(?<Depth>[)]))*)"+searchAlternation,
    RegexOptions.IgnoreCase
);

これはどのように機能しますか？まず、.NET はバランシンググループをサポートしているため、正しくネストされたパターンを検出できます。名前付きキャプチャグループ ( など) で何かをキャプチャするたびに(?<Depth>somepattern)、最後のキャプチャが上書きされるのではなく、スタックにプッシュされます。を使用して、そのスタックから 1 つのキャプチャをポップできます(?<-Depth>)。スタックが空の場合、これは失敗します (現在の位置で一致しないものと同様)。そして、スタックが空かどうかをで確認できます(?(Depth)patternIfNotEmpty|patternIfEmpty)。

それに加えて、.NET には、可変長の後読みをサポートする唯一の正規表現エンジンがあります。これら 2 つの機能を一緒に使用できれば、目的の文字列の 1 つの左側を見て~(、現在の入れ子構造の外にどこかに存在するかどうかを確認できます。

しかし、ここに問題があります (上記のリンクを参照)。後読みは .NET では右から左に実行されます。つまり、閉じ括弧をプッシュし、開き括弧に遭遇するとポップする必要があります。逆ではありません。

ここでは、その残忍な正規表現について説明します (.NET のように、後読みを下から上に読むと理解しやすくなります)。

(?<=              # lookbehind
  ~               # if there is a literal ~ to the left of our string, we're good
|                 # OR
  (?(Depth)(?!))  # if there is something left on the stack, we started outside
                  # of the parentheses that end end "~("
  ~[(]            # match a literal ~(
  (?>             # subpattern to analyze parentheses. the > makes the group
                  # atomic, i.e. suppresses backtracking. Note: we can only do
                  # this, because the three alternatives are mutually exclusive
    [^()]+        # consume any non-parens characters without caring about them
  |               # OR
    (?<-Depth>)?  # pop the top of stack IF possible. the last ? is necessary for
                  # like "human" where we start with a ( before there was a )
                  # which could be popped.
    [(]           # match a literal (
  |               # OR
    (?<Depth>[)]) # match a literal ) and push it onto the stack
  )*              # repeat for as long as possible
)                 # end of lookbehind
(?:homo sapiens|human|man|woman)
                  # match one of the words in the check list

score 1 · Accepted Answer

パランテシスチェックは文脈自由言語または文法であり、チェックのためにスタックが必要です。正規表現は正規言語に適しています。それらはメモリを持たないため、そのような目的には使用できません。

これを確認するには、文字列をスキャンして括弧を数える必要があります。

count0に初期化
文字列をスキャンします
- 現在の文字が(増分の場合count
- 現在の文字が)デクリメントされている場合count
- が負の場合count、括弧が矛盾しているというエラーを発生させます。例えば、)(
最後に、countが正の場合、閉じられていない括弧がいくつかあります
countがゼロの場合、テストに合格します

またはC＃の場合：

public static bool CheckParentheses(string input)
{
    int count = 0;
    foreach (var ch in input)
    {
        if (ch == '(') count++;
        if (ch == ')') count--;

        // if a parenthesis is closed without being opened return false
        if(count < 0)
            return false;
    }

    // in the end the test is passed only if count is zero
    return count == 0;
}

ご覧のとおり、正規表現はカウントできないため、そのようなパターンをチェックすることはできません。

score 1 · Accepted Answer

これは、正規表現を使用して行うことはできません。それらを使用するという考えを捨てて、のような通常の文字列操作を使用する必要がありますIndexOf。

c# - 文字列がC#でネストされた括弧を含む可能性のある特定のパターン内にあるかどうかを確認する正規表現

5 に答える 5

Related

Reference