c# - htmlagilitypack と動的コンテンツの問題

Question

Web __scraper__ アプリケーションを作成したいのですが、webbrowser コントロール、htmlagilitypack、xpath を使用して作成したいと考えています。

現在、問題なく動作する xpath ジェネレーター (この目的で webbrowser を使用) を作成できましたが、(javascript または ajax を介して) 生成されたコンテンツを動的に取得できない場合があります。また、webbrowser コントロール (実際には IE ブラウザー) が "tbody" などの追加のタグを生成するときに、htmlagilitypack の `htmlWeb.Load(webBrowser.DocumentStream);` が表示されないこともわかりました。

別のメモ。次のコードが実際に現在の Web ページソースを取得していることがわかりましたが、htmlagilitypack `(mshtml.IHTMLDocument3)webBrowser.Document.DomDocument;` を提供できませんでした。

それを手伝ってもらえますか？

score 30 · Accepted Answer

HtmlAgilityPack を取得して Web ページから ajax 動的コンテンツをレンダリングするのに何時間も費やしました。

答えは最初の投稿の下のコメントに隠されているので、それを正すべきだと思いました.

これは私が最初に使用した方法で、うまくいきませんでした:

private void LoadTraditionalWay(String url)
{
    WebRequest myWebRequest = WebRequest.Create(url);
    WebResponse myWebResponse = myWebRequest.GetResponse();
    Stream ReceiveStream = myWebResponse.GetResponseStream();
    Encoding encode = System.Text.Encoding.GetEncoding("utf-8");
    TextReader reader = new StreamReader(ReceiveStream, encode);
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.Load(reader);
    reader.Close();
}

WebRequest は、不足しているコンテンツをレンダリングする ajax クエリをレンダリングまたは実行しません。

これはうまくいった解決策です：

private void LoadHtmlWithBrowser(String url)
{
    webBrowser1.ScriptErrorsSuppressed = true;
    webBrowser1.Navigate(url);

    waitTillLoad(this.webBrowser1);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)webBrowser1.Document.DomDocument; 
    StringReader sr = new StringReader(documentAsIHtmlDocument3.documentElement.outerHTML); 
    doc.Load(sr);
}

private void waitTillLoad(WebBrowser webBrControl)
{
    WebBrowserReadyState loadStatus;
    int waittime = 100000;
    int counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if ((counter > waittime) || (loadStatus == WebBrowserReadyState.Uninitialized) || (loadStatus == WebBrowserReadyState.Loading) || (loadStatus == WebBrowserReadyState.Interactive))
        {
            break;
        }
        counter++;
    }

    counter = 0;
    while (true)
    {
        loadStatus = webBrControl.ReadyState;
        Application.DoEvents();
        if (loadStatus == WebBrowserReadyState.Complete && webBrControl.IsBusy != true)
        {
            break;
        }
        counter++;
    }
}

アイデアは、ajax コンテンツをレンダリングできる WebBrowser を使用してロードし、ページが完全にレンダリングされるまで待ってから、Microsoft.mshtml ライブラリを使用して HTML をアジリティパックに再解析することです。

これが、動的データにアクセスできる唯一の方法でした。

それが誰かを助けることを願っています

score 2 · Accepted Answer

セレンはトリックを行いますか。私が知る限り、それはブラウザエンジンのインスタンスを作成します..一種であり、jsを実行できるようにし、操作されたDOMの結果を取得できるようにする必要があります。

score -7 · Accepted Answer

HTMLAgilitypackドキュメントの次の方法を使用します。

htmlAgilityPackDocument.LoadHtml(this.browser.DocumentText);

また

if (this.browser.Document.GetElementsByTagName("html")[0] != null)
    _htmlAgilityPackDocument.LoadHtml(this.browser.Document.GetElementsByTagName("html")[0].OuterHtml);

c# - htmlagilitypack と動的コンテンツの問題

3 に答える 3

Related

Reference