c# - この HTML を解析して必要なコンテンツを取得するにはどうすればよいですか?

Question

現在、HTML ドキュメントを解析して、その中のすべての脚注を取得しようとしています。ドキュメントには何十ものそれらが含まれています。必要なコンテンツをすべて抽出するために使用する式が本当にわかりません。問題は、クラス (例: "calibre34") がすべてのドキュメントでランダム化されていることです。脚注がどこにあるかを確認する唯一の方法は、"hide" を検索することです。その後は常にテキストであり、< /td> タグで閉じられます。以下は、HTML ドキュメントの脚注の 1 つの例です。必要なのはテキストだけです。何か案は？みんなありがとう！

<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>

score 4 · Accepted Answer

HTMLAgilityPackを使用して HTML ドキュメントをロードし、次の XPath で脚注を抽出します。

//td[text()='[hide]']/following-sibling::td

基本的に、最初tdに含まれるすべてのノードを選択[hide]し、最後に次の兄弟に移動して選択します。だから次のtd。このノードのコレクションを取得したら、内部テキストを抽出できます (C# では、HtmlAgilityPack でサポートが提供されます)。

score 3 · Accepted Answer

MSHTML を使用して HTML ソースを解析するのはどうですか? これがデモコードです。

public class CHtmlPraseDemo
{
    private string strHtmlSource;
    public mshtml.IHTMLDocument2 oHtmlDoc;
    public CHtmlPraseDemo(string url)
    {
        GetWebContent(url);
        oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
        oHtmlDoc.write(strHtmlSource);
    }
    public List<String> GetTdNodes(string TdClassName)
    {
        List<String> listOut = new List<string>();
        IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
        IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
        foreach (IHTMLElement item in iec)
        {
            if (item.className == TdClassName)
            {
                listOut.Add(item.innerHTML);
            }
        }
        return listOut;
    }
    void GetWebContent(string strUrl)
    {
        WebClient wc = new WebClient();
        strHtmlSource = wc.DownloadString(strUrl);
    }



}

class Program
{
 static void Main(string[] args)
    {
        CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");

        Console.Write(oH.oHtmlDoc.title);
        List<string> l = oH.GetTdNodes("x");
        foreach (string n in l)
        {
            Console.WriteLine("new td");
            Console.WriteLine(n.ToString());

        }

        Console.Read();
    }
}

c# - この HTML を解析して必要なコンテンツを取得するにはどうすればよいですか?

2 に答える 2

Related

Reference