php - バランスの取れたネストされた wiki テンプレートを解析し、正規表現によって単一行パラメーターのコンテンツを抽出する

Question

ネストされた文字列または HTML の解析は実際のパーサーで行う方がよいことはわかっていますが、私の場合、単純なテンプレートがあり、テンプレートから Wiki パラメーター 'title' のタイトルコンテンツを抽出したいと考えていました。これを達成するのにしばらく時間がかかりましたが、Lars Olav Torvik ( http://regex.larsolavtorvik.com/ ) の正規表現ツールとこのユーザーフォーラムのおかげで、ここにたどり着きました。誰かがそれを便利だと思うかもしれません。(私たちは皆、貢献したいと思っていますよね? ;-) コメントで注釈を付けた次のコードは、そのトリックを実行します。いずれかのテンプレートにタイトルがない場合に、2 つのテンプレートが混在しないように、アサーションの周りを見てそれを行う必要がありました。

正規表現のコメントにある 2 つの質問については、まだ(?# Questions: …)よくわかりません(?R)。\{\{最も外側の定義されたレベル、つまり 2 番目の正規表現行と最後の正規表現行からチェックする内容を取得するということ\}\}ですか? それは正しいでしょうか？また、ブースワークの代替前++と前との違いは何なのか、試されてみるとそう思われます。+(?R)

ページ上の元の wiki テンプレート (最も単純):

$wikiTemplate = "
{{Templ1
| title = (1. template) title
}}

{{Templ2
| any parameter = something {{template}}
}}

{{Templ1
| title = (3. template) title
}}
";

交換：

$wikiTemplate = preg_replace(
  array(
  // tag all templates with START … END and add a TITLE-placeholder before
  // and take care of balanced {{ …  }} recursiveness 
    "@(?s)   (?# switch to dotall match, i.e. also linebreaks )
      \{\{ (?# find two {{ )
      (?: (?# group 1 as a non-backreferenced match  )
        (?:  (?# group 2 as a non-backreferenced match  )
          (?! (?# in group 1 anything but not {{ or }} )
            \{\{ 
            |   (?# or )
            \}\}
          )
          .
        )++  (?# Question: what is the differenc between ++ and + here? )
        |    (?# or )
        (?R) (?# is it recursive of what is defined in the outermost,
              i.e. 2nd regexp line with \{\{ and last line with \}\}
              Question: is that here understood correctly? ) 
      )
      * (?# zero or many times of the inner regexp defintions )
      \}\} (?# find two }} )
    @x",// x-extended → ignore white space in the pattern
  // replace TITLE by single line content of title parameter 
    "@
      (?<=TITLE) (?# TITLE must preceed the following linebreak but is not
                  backreferenced within \\0, i.e. the whole returned match)
      ([\n\r]+)  (?#linebr in 1 may also described as . because of
                  s-modifier dotall)
      (?:        (?# start non-backreferenced match )
        .        (?# any character but not followed by START)
        (?!START)
      )+      (?# multiple times)
      (?:     (?# start non-backreferenced match )
        \|\s*title\s*=\s* (?#find the parameter '| title = ')
      )
      ([^\r\n]+)  (?#get title now to \\2 but exclude the line break. 
                   Note it is buggy when there is no line break )
      (?:     (?# start non-backreferenced match )
        .     (?# any character but not followed by END)
        (?!END)
      )
      +       (?# multiple times)
      .       (?# any single character, e.g. the last  because as all
               stuff before captures anything not followed by END)
      (?:END) (?#a not backreferenced END)
    @msx", // m-multiline, s-dotall match also linebreaks,
           // x-extended → ignore white space in the pattern
  ), 
  array(
    "TITLE\nSTART\\0END", // \0 is the whole returned match, i.e. the template
  # replace the TITLE to  TITLEtitle contentTITLE…
    "\\2TITLE\\0",
  ),
  $wikiTemplate
);
print_r($wikiTemplate);

次に、各テンプレートの上に TITLE でタグ付けされたタイトルが出力されますが、タイトルがあった場合のみです。

TITLE(1. template) titleTITLE
START{{Templ1
 | title = (1. template) title
}}END

TITLE
START{{Templ2
 | any parameter = something {{template}}
}}END

TITLE(3. template) titleTITLE
START{{Templ1
 | title = (3. template) title
}}END

正規表現の理解、またはいくつかの改善に関する私の質問の内部はありますか? ありがとう、アンドレアス。

score 0 · Accepted Answer

++所有量指定子です。+反復量指定子 ( , *, {...}) に aを追加すると、+所有格になります。つまり、正規表現エンジンは、最初に繰り返しを終了すると、バックトラックして繰り返しを減らすことはありません。したがって、彼らは基本的に繰り返しを原子団にします。これは最適化である場合もあれば、実際に違いが生じる場合もあります。あなたはここでいくつかの非常に良い読書をすることができます.

そして、あなたの2番目の質問については、yes (?R)は単純に完全なパターンを再度照合しようとします。これについては、PCRE の PHP ドキュメントに良い記事があります。

その他の質問については、 Code Reviewで質問するのがよいでしょう。

php - バランスの取れたネストされた wiki テンプレートを解析し、正規表現によって単一行パラメーターのコンテンツを抽出する

1 に答える 1

Related

Reference