c# - 正規表現を使用して複数の HTML タグ間のテキストを取得する

Question

正規表現を使用して、複数の DIV タグの間のテキストを取得できるようにしたいと考えています。たとえば、次のようになります。

<div>first html tag</div>
<div>another tag</div>

出力します：

first html tag
another tag

私が使用している正規表現パターンは、最後の div タグのみに一致し、最初の div タグを見逃しています。コード：

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

出力：

見つかった一致: 1

内部 DIV: これは別のテストです

score 17 · Accepted Answer

パターンを貪欲でない一致に置き換えます

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}

score 10 · Accepted Answer

他の人が言及しなかったようHTML tags with attributesに、これに対処するための私の解決策は次のとおりです。

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World

score 1 · Accepted Answer

このコードは機能するはずだと思います：

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }

score 1 · Accepted Answer

短いバージョンでは、すべての状況でこれを正しく行うことはできません。正規表現が必要な情報を抽出できない有効な HTML のケースが常に存在します。

その理由は、HTML が正規表現よりも複雑なクラスである文脈自由文法であるためです。

例を次に示します。積み上げられた複数の div がある場合はどうなるでしょうか。

<div><div>stuff</div><div>stuff2</div></div>

他の回答としてリストされている正規表現は、次のものを取得します。

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

これは、HTML を解析しようとするときに正規表現が行うことだからです。

すべてのケースを解釈する方法を理解する正規表現を書くことはできません。正規表現にはそれができないからです。非常に限定された HTML のセットを扱っている場合は可能かもしれませんが、この事実を念頭に置いておく必要があります。

詳細: https://stackoverflow.com/a/1732454/2022565

score 1 · Accepted Answer

まず最初に、HTML ファイルには改行記号 ("\n") があることを覚えておいてください。これは、正規表現をチェックするために使用している文字列には含まれていません。

次に、正規表現を使用します。

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

また、この種の情報を探すのに適した場所:

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

メイマン

score 1 · Accepted Answer

以下の正規表現が機能することを願っています：

<div.*?>(.*?)<*.div>

目的の出力が得られます

これはテストですこれは別のテストです

c# - 正規表現を使用して複数の HTML タグ間のテキストを取得する

7 に答える 7

Related

Reference