regex - 正規表現 HTML ネストされた引用符の置換

Question

次のような HTML に複数のネストされた引用符があります。

<div class="quote-container">
   <div class="quote-block">
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
   </div>
</div>

引用符を検索して削除する必要があります。私は表現を使用します：

<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>

これは一重引用符で機能します。ただし、複数のネストされた引用符には問題があります (上記の例)。

私の仕事は、次のものを検索することです。

<div class="quote-container">.*<div class="quote-block">

を含まない任意の文字列に加えて

<div

で終わる

.*</div>.*</div>

次のような後読みアサーションと先読みアサーションを試しました。

<div class="quote-container">.*<div class="quote-block">.*(?!<div).*</div>.*</div>

しかし、それらは機能しません。

私の仕事をする方法はありますか？TextPipe で使用できる perl 式が必要です (フォーラムの解析に使用し、後でテキストから音声への変換を行います)。

前もって感謝します。

score 0 · Accepted Answer

正規表現は、ネストされた構造を操作するための適切な選択ではありません。この問題のための特定のパーサーを作成します (単純なスタックベースのパーサーで十分です)。

score 0 · Accepted Answer

あなたの問題は、貪欲な表現を使用していることだと思います.*。

.*すべてを貪欲でないものに置き換えてみてください.*?

score 0 · Accepted Answer

置き換える引用符がなくなるまで引用符を置き換えることで、この問題を個人的に解決します。これを 1 回の正規表現置換で処理する方法は実際にはありません。次のようにする必要があります。

疑似コード:

html="... from your post ...";
do{
 newhtml=html
 newhtml=replace(
        '/<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>/s',
        '',
        newhtml
    )
} while(newhtml!=html)
html=newhtml

これにより、あらゆる種類のネストされた引用符が処理されます。

regex - 正規表現 HTML ネストされた引用符の置換

3 に答える 3

Related

Reference