java - JAVAで後方参照を使用して正規表現をキャプチャする再帰グループ

Question

正規表現内のグループへの後方参照も使用して、文字列で複数のグループを再帰的にキャプチャしようとしています。Pattern と Matcher と "while(matcher.find())" ループを使用していますが、すべてのインスタンスではなく、最後のインスタンスのみをキャプチャしています。私の場合、可能なタグは <sm>、<po>、<pof>、<pos>、<poi>、<pol>、<poif>、<poil> のみです。これらはフォーマットタグであるため、キャプチャする必要があります。

タグの外側にある任意のテキスト (「通常の」テキストとして書式設定できるようにするため、あるグループのタグの前のテキストをキャプチャし、別のグループでタグ自体をキャプチャし、繰り返し実行することでこれを行っています。元の文字列からキャプチャされたすべてのものを削除します; 最後にテキストが残っている場合は、それを「通常の」テキストとしてフォーマットします)
タグ内のテキストをどのようにフォーマットする必要があるかを知るためのタグの「名前」
タグ名とそれに関連付けられたルールに従ってフォーマットされるタグのテキストコンテンツ

ここに私のサンプルコードがあります:

        String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’&lt;/po><poil>for out of man this one has been taken.”&lt;/poil>";
        String remainingText = currentText;

        //first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text
        if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*"))
        {                
            //an opening or closing tag has been found, so let us start our pattern captures
            //I am using a backreference \\2 to make sure the closing tag is the same as the opening tag
            Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);
            Matcher matcher1 = pattern1.matcher(currentText);                
            int iteration = 0;
            while(matcher1.find()){
                System.out.print("Iteration ");
                System.out.println(++iteration);
                System.out.println("group1:"+matcher1.group(1));
                System.out.println("group2:"+matcher1.group(2));
                System.out.println("group3:"+matcher1.group(3));
                System.out.println("group4:"+matcher1.group(4));

                if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false)
                {
                    m_xText.insertString(xTextRange, matcher1.group(1), false);
                    remainingText = remainingText.replaceFirst(matcher1.group(1), "");
                }
                if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false)
                {
                    switch (matcher1.group(2)) {
                        case "pof": [...]
                        case "pos": [...]
                        case "poif": [...]
                        case "po": [...]
                        case "poi": [...]
                        case "pol": [...]
                        case "poil": [...]
                        case "sm": [...]
                    }
                    remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", "");
                }
            }

System.out.println はコンソールに 1 回だけ出力され、次の結果が得られます。

Iteration 1:
  group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’&lt;/po>; 
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

グループ 3 は無視されます。有用なグループは 1、2、および 4 だけです (グループ 3 はグループ 2 の一部です)。前の「pof」、「poi」、および「po」タグをキャプチャしないのに、これが最後のタグインスタンス「poil」のみをキャプチャするのはなぜですか?

私が見たい出力は次のようになります。

Iteration 1:
  group1:the man said:
  group2:pof
  group3:po
  group4:“This one, at last, is bone of my bones

Iteration 2:
  group1:
  group2:poi
  group3:po
  group4:and flesh of my flesh;

Iteration 3:
  group1:
  group2:po
  group3:po
  group4:This one shall be called ‘woman,’

Iteration 3:
  group1:
  group2:poil
  group3:po
  group4:for out of man this one has been taken.”

score 1 · Accepted Answer

この問題に対する答えを見つけたところです。4 番目のキャプチャグループと同じように、最初のキャプチャで貪欲でない量指定子が必要でした。これは必要に応じて正確に機能しています：

Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS);

java - JAVAで後方参照を使用して正規表現をキャプチャする再帰グループ

1 に答える 1

Related

Reference