php - サーバー生成コンテンツを含むサイトをクロールする方法は?

Question

Web サイトからデータを取得してデータベースに挿入する単純な php クローラーを作成しています。事前定義された URL から始めます。次に、ページのコンテンツ (php の file_get_contents から) を調べ、最終的file_get_contentsにそのページのリンクで使用します。リンクから取得している URL は、それらをエコーしてブラウザから単独で開くと問題ありません。ただし、使用file_get_contentsして結果をエコーすると、サイトから動的に作成されたサーバー側データに関連するエラーが原因で、ページが正しく表示されません。エコーされたページのコンテンツには、サイトに必要なリソースが見つからないため、必要なサーバーからのリストされたデータが含まれていません。

エコーされた Web ページの相対パスが、目的のコンテンツの生成を許可していないようです。

ここで誰かが私を正しい方向に向けることができますか?

どんな助けでも大歓迎です！

これまでのコードの一部を次に示します。

function crawl_all($url)
{
    $main_page = file_get_contents($url);

    while(strpos($main_page, '"fl"') > 0)
    {   
        $subj_start  = strpos($main_page, '"fl"');      // get start of subject row
        $main_page   = substr($main_page, $subj_start); // cut off everything before subject row
        $link_start  = strpos($main_page, 'href') + 6;  // get the start of the subject link
        $main_page   = substr($main_page, $link_start); // cut off everything before subject link
        $link_end    = strpos($main_page, '">') - 1;    // get the end of the subject link
        $link_length = $link_end + 1;             
        $link = substr($main_page, 0, $link_length);    // get the subject link

        crawl_courses('https://whatever.com' . $link);      
    }
}

/* Crawls all the courses for a subject. */
function crawl_courses($url)
{
    $subj_page = file_get_contents($url);
    echo $url;           // website looks fine when in opened in browser
    echo $subj_page;     // when echo'd, the page does not contain most of the server-side generated data i need

    while(strpos($subj_page, '<td><a href') > 0)
    {
        $course_start = strpos($subj_page, '<td><a href');
        $subj_page    = substr($subj_page, $course_start);
        $link_start   = strpos($subj_page, 'href') + 6;
        $subj_page    = substr($subj_page, $link_start);
        $link_end     = strpos($subj_page, '">') - 1;
        $link_length  = $link_end + 1;
        $link = substr($subj_page, 0, $link_length);

        //crawl_professors('https://whatever.com' . $link);
    }
}

score 0 · Accepted Answer

高度な html dom パーサーを試してください。ここにあります.... http://sourceforge.net/projects/advancedhtmldom/

php - サーバー生成コンテンツを含むサイトをクロールする方法は?

1 に答える 1

Related

Reference