php - 「次の」ページの問題をスクレイピングする

Question

シンプルな HTML DOM を使用して、Zen-Cart ストアから製品セクションごとに製品データをスクレイピングしようとしています。最初のページからデータをうまく取得できますが、製品の「次の」ページを読み込もうとすると、サイトは index.php ランディングページを返します。

この関数を *http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36&sort=20a&page=2* で直接使用すると、2 ページ目から製品情報がスクレイピングされます。

cURL を使用すると、同じことが起こります。

getPrices('http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36');

   function getPrices($sectionURL) {

$opts = array('http' => array('method' => "GET", 'header' => "Accept-language: en\r\n" . "User-Agent:    Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n" . "Cookie:   zenid=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\r\n"));
$context = stream_context_create($opts);

$html = file_get_contents($sectionURL, false, $context);
$dom = new simple_html_dom();
$dom -> load($html);

//Do cool stuff here with information from page.. product name, image, price and more info URL

if ($nextPage = $dom -> find('a[title= Next Page ]', 0)) {
    $nextPageURL = $nextPage -> href;
    echo $nextPageURL;
    $dom -> clear();
    unset($dom);
    getPrices($nextPageURL);
} else {
    echo "\nNo more pages to scrape!!";
    $dom -> clear();
    unset($dom);
}

}

この問題を解決する方法についてのアイデアはありますか?

score 0 · Accepted Answer

ループ内の関数に渡される次のページの URL が & の代わりに & を渡していることが判明し、file_get_contents はそれを好まなかった。

$sectionURL = str_replace( "&amp;", "&", urldecode(trim($sectionURL)) );

score 0 · Accepted Answer

潜在的な犯人がたくさんいます。Cookie を追跡していないか、リファラーを設定していないため、simple_html_dom があなたを失望させている可能性が高いです。

私の推奨事項は、フィドラーまたはチャールズを介してリクエストをプロキシし、ブラウザーからのように見えるようにすることです。

php - 「次の」ページの問題をスクレイピングする

2 に答える 2

Related

Reference