c# - 正規表現を使用してタグリンクを抽出する方法（REGEX-C＃）

Question

私はこれまでにこれを持っています：

<a href="(http://www.imdb.com/title/tt\d{7}/)".*?>.*?</a>

c＃

ArrayList imdbUrls = matchAll(@"<a href=""(http://www.imdb.com/title/tt\d{7}/)"".*?>.*?</a>", html);
private ArrayList matchAll(string regex, string html, int i = 0)
{
  ArrayList list = new ArrayList();
  foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
    list.Add(m.Groups[i].Value.Trim());
  return list;
}

HTMLページからimdbリンクを抽出しようとしていますが、この正規表現の何が問題になっていますか？

これの主なアイデアは、グーグルで映画を検索し、結果でimdbへのリンクを探すことです

score 1 · Accepted Answer

正規表現は、HTMLファイルの解析には適していません。HTMLは厳密ではなく、その形式も規則的ではありません。

htmlagilitypackを使用します。このコードを使用して、HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

List<string> anchorImdbList = doc.DocumentNode.SelectNodes("//a[@href]")//this xpath selects all anchor tags
                  .Select(p => p.Attributes["href"].Value)
                  .Where(x=>Regex.IsMatch(x,@".*?www\.imdb\.com.*?"))
                  .Select(y=>y)
                  .ToList<string>();

score 0 · Accepted Answer

これを試して：

string tag = "tag of the link";
string emptystring = Regex.Replace(tag, "<.*?>", string.Empty);

アップデート：

string emptystring = Regex.Replace(tag, @"<[^>]*>", string.Empty);

score 0 · Accepted Answer

スラッシュをエスケープする必要があります。試す：

<a href="(http:\/\/www.imdb.com\/title\/tt\d{7}\/)".*?>.*?<\/a>

複雑なページからhtml要素を解析する必要がある場合、正規表現は非常に面倒になります。他の人が提案しているように、 HtmlAgilityPackを試してください。

c# - 正規表現を使用してタグリンクを抽出する方法（REGEX-C＃）

3 に答える 3

Related

Reference