c# - HtmlAgilityPackを使用してWebサイトをスクレイピングします。GETの応答が期待どおりではない

Question

System.Net.HttpRequestを使用して、コード内の次の検索エンジンでのユーザー検索を模倣したいと思います。

http://www.scirus.com

検索URLの例は次のとおりです。

http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s

HTTPGETを実行するための次のコードがあります。HtmlAgilityPackを使用していることに注意してください。

protected override HtmlDocument MakeRequestHtml(string requestUrl)
{
    try
    {
        HttpWebRequest request = WebRequest.Create(requestUrl) as HttpWebRequest;
        request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
        HttpWebResponse response = request.GetResponse() as HttpWebResponse;

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.Load(response.GetResponseStream());
        return (htmlDoc);

    }
    catch (Exception e)
    {
        Console.WriteLine(e.Message);
        Console.Read();
        return null;
    }
}

ここで、「requestUrl」は上記の検索URLの例です。

htmlDoc.DocumentNode.InnerHtmlのコンテンツには検索結果が含まれておらず、上記の検索URLの例をブラウザにコピーして貼り付けた場合に表示される検索結果ページのようには見えません。

これは、リクエストを実行するために最初にセッションが必要なためだと思います。ユーザーエージェントの振る舞いを再現するための実行可能な方法があるかどうか誰かにアドバイスできますか？それとも、私が知らない検索結果を「スクレイピング」するという目標を達成するためのより良い方法がありますか？提案をお願いします。

Robots.txtの内容：

# / robots.txt file for http://www.scirus.com

User-agent: NetMechanic
Disallow: /srsapp/sciruslink

User-agent: *
Disallow: /srsapp/sciruslink
Disallow: /srsapp/search
Disallow: /srsapp/search_simple
Disallow: /search_simple
# for dev and accept server uncomment below line at Build time to disallow robots completely
##Disallow: /

htmlDoc.DocumentNode.InnerHtmlのコンテンツ

score 1 · Accepted Answer

おそらく、ユーザーエージェントを設定する必要があります。

request.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";

また、サイトのRobots.txtファイルをチェックして、歓迎されていることを確認する必要があります。

score 1 · Accepted Answer

OK私は実際にwebclientでテストしました

        static void Main(string[] args)
    {
        WebClient client = new WebClient();
        client.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0");
        string str = client.DownloadString("http://www.scirus.com/srsapp/search?q=core+facilities&t=all&sort=0&g=s"); 
        byte[] bit = new System.Text.ASCIIEncoding().GetBytes(str);
        FileStream fil = File.OpenWrite("test.txt");
        fil.Write(bit,0,bit.Length);
    }

そしてここにダウンロードされたファイルがありますhttp://pastebin.com/qswtgC4n

score -1 · Accepted Answer

特にドキュメントをロードするコードが以前に機能していた場合は、サーバーに過度にpingを実行していないことを確認してください。robots.txtまたは同様のページに移動するサーバールールに遭遇した可能性があります。

c# - HtmlAgilityPackを使用してWebサイトをスクレイピングします。GETの応答が期待どおりではない

3 に答える 3

Related

Reference