c# - htmlagilitypack を使用してテキストと画像の両方を抽出する

Question

Web ページからコンテンツを抽出しています。Web ページでは、電話番号や電子メール ID などの情報が画像に保存されます。そのテーブル内の画像とテキストも抽出したいと思います。出力文字列では、画像とテキストを含む Web ページに表示されるのと同じ方法で出力したいと考えています。

以下、Webページの内容です。

<table>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
<tr>
   <td>text</td>
   <td><img src="" /></td>
</tr>
</table>

次のようにテキストと画像の両方を抽出できますか。

テキスト画像

テキスト画像

テキスト画像

score 1 · Accepted Answer

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
HtmlNode imgNode = doc.DocumentElement.selectSingleNode("/table/tr/td/img");

//Just get Images only
foreach (HtmlNode img in doc.DocumentElement.SelectNodes("//img"))
{
  string imgSrc = img.Attributes["src"].Value;
}

//get td's and ignore img in it
foreach (HtmlNode td in doc.DocumentElement.SelectNodes("//td"))
{
  HtmlNode img = td.ChildNodes["img"];
  if(img == null)
  {
    string tdText = td.InnerText;
  }
}

//Get Images that have style attribute
foreach (HtmlNode img in doc.DocumentElement.SelectNodes("//img[@style]"))
{
  string style = img.Attributes["style"].Value.ToLower();
  style = style.Replace("background:url('", "");
  style = style.Replace("')", "");
 //now you have the image url from the background

}

score 0 · Accepted Answer

これを試して

foreach (HtmlNode img in root.SelectNodes("//img"))
{
    string att = img.Attributes["src"].Value;
    anchorTags.Add(att);
}

c# - htmlagilitypack を使用してテキストと画像の両方を抽出する

2 に答える 2

Related

Reference