php - phpとfopenを使用した画面スクレイピング

Question

重複の可能性：
file_get_contentsを使用したphpでの画面のスケープイン

誰かが私を助けることができますか..私はLateRooms.comからホテルのレビューをこすり取ろうとしています私はすでにアフィリエイトとして許可を持っているのでそれは悪い考えを教えてはいけません

私のコード：

<?php
header('content-type: text/plain');

$contents = file_get_contents('http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx');
$contents = preg_replace('/\s(1,)/', ' ', $contents);

print $contents . "\n";

$records = preg_split('/<div id="review/', $contents);

for ($ix = 1; $ix < count($records); $ix++) {

$tmp = $records[$ix];

preg_match('/id="review"/', $tmp, $match_reviews);

print_r($match_reviews);

exit();

}
?>

これは本当にうまく機能します。唯一の問題は、コードのページ全体を取得し、divid'review'と一致しないことです。

前もって感謝します

score 3 · Accepted Answer

function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}
function DOMinnerHTML($element){ 
$innerHTML = ""; 
$children = $element->childNodes; 
foreach ($children as $child) 
{ 
    $tmp_dom = new DOMDocument(); 
    $tmp_dom->appendChild($tmp_dom->importNode($child, true)); 
    $innerHTML.=trim($tmp_dom->saveHTML()); 
} 
return $innerHTML; 
}
$url  = 'http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx';
$html = file_get_contents_curl($url);

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$div_elements = $doc->getElementsByTagName('div');

if ($div_elements->length <> 0){
foreach ($div_elements as $div_element) {
    if ($div_element->getAttribute('class') == 'review newReview'){
        $reviews[] = DOMinnerHTML($div_element);

    }
}
}

print_r($reviews);

これを試してみてください。すべてのレビューが返されます。要件に応じてコンテンツを絞り込むことができます。

php - phpとfopenを使用した画面スクレイピング

1 に答える 1

Related

Reference