php - PHPで大量のページをスクレイピングする最速の方法は何ですか?

Question

私は、いくつかのサイトをスクレイピングし、ユーザーが検索できるようにそれらの情報をインデックス化することに依存するデータアグリゲーターを持っています。

毎日膨大な数のページをスクレイピングできるようにする必要があり、単純なカールリクエストを使用して問題に遭遇しました。これは、長時間連続して実行するとかなり遅くなります (スクレイパーは基本的に 24 時間年中無休で実行されます)。

単純な while ループでマルチカールリクエストを実行すると、かなり時間がかかります。バックグラウンドプロセスで個々の curl リクエストを実行することで高速化しましたが、遅かれ早かれ遅いリクエストが積み重なってサーバーがクラッシュします。

データをスクレイピングするより効率的な方法はありますか? おそらくコマンドラインカール？

score 2 · Accepted Answer

ページ数が多い場合は、ネットワーク I/O の待機にほとんどの時間を費やすことになるため、ある種のマルチスレッドアプローチが必要になります。

前回 PHP スレッドで遊んだときは、それほど素晴らしいオプションではありませんでしたが、おそらくそれは変わったのでしょう。PHP に固執する必要がある場合は、マルチプロセスアプローチを採用する必要があります。ワークロードを N 個のワークユニットに分割し、それぞれが 1 個のワークユニットを受け取るスクリプトの N インスタンスを実行します。

堅牢で優れたスレッド実装を提供する言語も、別のオプションです。私は Ruby と C のスレッドで良い経験をしてきましたが、Java スレッドも非常に成熟していて信頼できるようです。

PHP スレッドは、私が最後に遊んだとき (~4 年前) から改善されている可能性があり、一見の価値があります。

score 0 · Accepted Answer

単一の curl リクエストを実行したい場合は、PHP で Linux の下でバックグラウンドプロセスを開始できます。

proc_close ( proc_open ("php -q yourscript.php parameter1 parameter2 & 2> /dev/null 1> /dev/null", array(), $dummy ));

パラメータを使用して、SQL の LIMIT のように、使用する URL に関する情報を PHP スクリプトに与えることができます。

PID をどこかに保存して、必要な数のプロセスを同時に実行したり、時間内に終了していないプロセスを強制終了したりすることで、実行中のプロセスを追跡できます。

score 0 · Accepted Answer

私の経験では、固定数のスレッドで curl_multi リクエストを実行するのが最速の方法です。改善を提案できるように、使用しているコードを共有していただけますか? この回答には、スレッド化されたアプローチを使用した curl_multi のかなり適切な実装があります。再現されたコードは次のとおりです。

// -- create all the individual cURL handles and set their options
$curl_handles = array();
foreach ($urls as $url) {
    $curl_handles[$url] = curl_init();
    curl_setopt($curl_handles[$url], CURLOPT_URL, $url);
    // set other curl options here
}

// -- start going through the cURL handles and running them
$curl_multi_handle = curl_multi_init();

$i = 0; // count where we are in the list so we can break up the runs into smaller blocks
$block = array(); // to accumulate the curl_handles for each group we'll run simultaneously

foreach ($curl_handles as $a_curl_handle) {
    $i++; // increment the position-counter

    // add the handle to the curl_multi_handle and to our tracking "block"
    curl_multi_add_handle($curl_multi_handle, $a_curl_handle);
    $block[] = $a_curl_handle;

    // -- check to see if we've got a "full block" to run or if we're at the end of out list of handles
    if (($i % BLOCK_SIZE == 0) or ($i == count($curl_handles))) {
        // -- run the block

        $running = NULL;
        do {
            // track the previous loop's number of handles still running so we can tell if it changes
            $running_before = $running;

            // run the block or check on the running block and get the number of sites still running in $running
            curl_multi_exec($curl_multi_handle, $running);

            // if the number of sites still running changed, print out a message with the number of sites that are still running.
            if ($running != $running_before) {
                echo("Waiting for $running sites to finish...\n");
            }
        } while ($running > 0);

        // -- once the number still running is 0, curl_multi_ is done, so check the results
        foreach ($block as $handle) {
            // HTTP response code
            $code = curl_getinfo($handle,  CURLINFO_HTTP_CODE);

            // cURL error number
            $curl_errno = curl_errno($handle);

            // cURL error message
            $curl_error = curl_error($handle);

            // output if there was an error
            if ($curl_error) {
                echo("    *** cURL error: ($curl_errno) $curl_error\n");
            }

            // remove the (used) handle from the curl_multi_handle
            curl_multi_remove_handle($curl_multi_handle, $handle);
        }

        // reset the block to empty, since we've run its curl_handles
        $block = array();
    }
}

// close the curl_multi_handle once we're done
curl_multi_close($curl_multi_handle);

トリックは、一度に多くの URL をロードしないことです。そうすると、遅いリクエストが完了するまでプロセス全体がハングします。BLOCK_SIZE帯域幅がある場合は、8 以上を使用することをお勧めします。

php - PHPで大量のページをスクレイピングする最速の方法は何ですか?

3 に答える 3

Related

Reference