c# - HtmlAgilityPack によるスクレイピング

Question

そこから値を削除したい巨大な html ページがあります。

Firebug を使用して必要な要素の XPath を取得しようとしましたが、静的な XPath ではなく、時々変更されるため、必要な値を取得するにはどうすればよいですか。

次のスニペットでは、20 にある 1 時間あたりの木材の生産量を取得したいと考えています。

    <div class="boxes-contents cf"><table id="production" cellpadding="1" cellspacing="1">
    <thead>
        <tr>
            <th colspan="4">
                Production per hour:            </th>
        </tr>
    </thead>
    <tbody>
                <tr>
            <td class="ico">
                <img class="r1" src="img/x.gif" alt="Lumber" title="Lumber" />
            </td>
            <td class="res">
                Lumber:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r2" src="img/x.gif" alt="Clay" title="Clay" />
            </td>
            <td class="res">
                Clay:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r3" src="img/x.gif" alt="Iron" title="Iron" />
            </td>
            <td class="res">
                Iron:
            </td>
            <td class="num">
                20          </td>
        </tr>
                <tr>
            <td class="ico">
                <img class="r4" src="img/x.gif" alt="Crop" title="Crop" />
            </td>
            <td class="res">
                Crop:
            </td>
            <td class="num">
                59          </td>
        </tr>
            </tbody>
</table>
    </div>

score 1 · Accepted Answer

Html アジリティパックを使用すると、次のようなことを行うことができます。

byte[] htmlBytes;
MemoryStream htmlMemStream;
StreamReader htmlStreamReader;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlBytes = webclient.DownloadData(url);
htmlMemStream = new MemoryStream(htmlBytes);
htmlStreamReader = new StreamReader(htmlMemStream);
htmlDoc.LoadHtml(htmlStreamReader.ReadToEnd());

var table = htmlDoc.DocumentNode.Descendants("table").FirstOrDefault();

var lumberTd = table.Descendants("td").Where(node => node.Attributes["class"] != null && node.Attributes["class"].Value == "num").FirstOrDefault();

string lumberValue = lumberTd.InnerText.Trim();

警告、「FirstOrDefault()」は null を返す可能性があるため、おそらくそこにチェックを入れる必要があります。

それが役立つことを願っています。

score 0 · Accepted Answer

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(fileName);

var result = doc.DocumentNode.SelectNodes("//div[@class='boxes-contents cf']//tbody/tr")
                .First(tr => tr.Element("td").Element("img").Attributes["title"].Value == "Lumber")
                .Elements("td")
                .First(td=>td.Attributes["class"].Value=="num")
                .InnerText
                .Trim();

c# - HtmlAgilityPack によるスクレイピング

2 に答える 2

Related

Reference