html - DIVコンテンツを除くすべてを保持する正規表現

Question

私はjEditを使用していますが、コードが正しくないHTMLファイルがたくさんあり、周囲のHTMLではなくメインコンテンツを取得したいと考えています。

私はその間のすべてが必要<div class="main-text">です</div>。

これを行うにはREGEXの方法が必要です。jEditを使用すると、正規表現に置き換えて検索できます。

私は正規表現に堪能ではなく、それを解決するのに長い時間がかかります-誰かが迅速に助けてくれますか？

score 1 · Accepted Answer

あなたの質問を文字通りとると、あなたは置き換えることができます：

/.*<div class="main-text">(.*?)<\/div>.*/

with \1（または$1エディタの使用内容によって異なります）。

しかし、あなたの「メインテキスト」要素に別の要素が含まれている場合はどうなるので、彼は<div>あなたを噛むようになります。これが起こらないと確信しているなら、あなたは大丈夫です。そうでなければ、あなたは困惑しています。/.*<div class="main-text">/空の文字列に置き換えてから、手動で末尾を探し、その後のすべてを削除する方が簡単な場合があります。

さらに言えば、このタスクは手動で行うのが最も簡単な場合があるため、コードの実行後に再確認する必要はありません。

score 0 · Accepted Answer

この正規表現はあなたの問題を解決するはずです：/<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi

Perlでの例を次に示します。

my $str = '<div class="main-text"> and the next </div>';
$str =~ /<\s*div\s+class="main-text"[^>]*>(.*?)<\/div>/gi;
print $1;

例はPerlですが、正規表現は言語に依存せずに適用できます。

正規表現の説明は次のとおりです。

/       -start of the regex
   <\s*    -we can have < and whitespace after it
      div     -matches "div"
         \s+     -matches one or more whitespaces after the <div
         class="main-text"    -matches class="main-text" (so <div class="main-text" to here)
         [^>]*       -matches everything except >, this is because you may have more attributes of the div
         >          -matches >, so <div class="main-text"> until now
      (.*?)        -matches everything until </div> and saves it in $1
   <\/div>        -matches </div>, so now we have <div class="main-text">( and the next )</div> until now
/gi       -makes the regex case insensitive

score 0 · Accepted Answer

この正規表現は、htmlタグ間のテキストをキャプチャします

<(?<tag>div).*?>(?<text>.*)</\k<tag>>

分解：

<（？div）。*？>：divを持つ最初のオープンタグ。このグループは「タグ」と呼ばれます。
（？。*）：タグ間のテキストキャプチャ
>：終了divタグ、「tag」と呼ばれるグループへの後方参照

最後に、キャプチャの結果は「タグ」と「テキスト」の2つのグループになり、キャプチャは「テキスト」になります。

html - DIVコンテンツを除くすべてを保持する正規表現

3 に答える 3

Related

Reference