php - PHPでスクリーンスクレイパーを実装するにはどうすればよいですか？

Question

プログラムを介してWebサイトにログインするためのユーザーIDとパスワードを持っています。ログインすると、URLはhttp://localhost/Test/loginpage.htmlからhttp://www.4wtech.com/csp/web/Employee/Login.cspに変更されます。

PHPを使用して2番目のURLからデータを「スクリーンスクレイピング」するにはどうすればよいですか？

score 4 · Accepted Answer

カールを使用します。Curl はページにログインしてから、新しく参照されたページにアクセスし、ページ全体をダウンロードできます。

curlの php マニュアルと、このチュートリアルを確認してください: How to screen-scrape with PHP and Curl .

score 3 · Accepted Answer

あなたの質問を理解したかどうかはよくわかりません。しかし、本当に PHP でスクリーンスクレイピングを行うつもりなら、simple_html_domパーサーをお勧めします。これは、PHP で CSS セレクターを使用できるようにする小さなライブラリです。私にとって、PHP でのスクリーンスクレイピングはかつてないほど容易になりました。次に例を示します。

// Create DOM from URL or file
$html = file_get_html('http://stackoverflow.com/');

// Find all links
foreach($html->find('a') as $element) {
       echo $element->href . '<br>';
}

score 0 · Accepted Answer

重要！

スクレイピングは常に許可されているわけではないことに注意してください。ページをスクレイピングすることに決めた場合は、そのページの所有者からスクレイピングを許可されていることを確認してください。そうしないと、違法なことをしてしまう可能性があります。

ページのスクレイピングが許可されていると仮定して、次の手順を適用します。

HTTP リクエスト

まず、ページのコンテンツを取得するために HTTP 要求を行います。それにはいくつかの方法があります。

fopen

HTTP リクエストを送信する最も基本的な方法は、fopen. 主な利点は、一度に読み取る文字数を設定できることです。これは、非常に大きなファイルを読み取る場合に役立ちます。ただし、正しく行うのは最も簡単なことではありません。また、非常に大きなファイルを読み込んでおり、メモリの問題が発生する恐れがない限り、これを行うことはお勧めしません。

$fp = fopen("http://www.4wtech.com/csp/web/Employee/Login.csp", "rb");
if (FALSE === $fp) {
    exit("Failed to open stream to URL");
}

$result = '';

while (!feof($fp)) {
    $result .= fread($fp, 8192);
}
fclose($fp);
echo $result;

file_get_contents

最も簡単な方法は、を使用することfile_get_contentsです。if は fopen とほぼ同じですが、選択できるオプションが少なくなります。ここでの主な利点は、必要なコードが 1 行だけであることです。

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
echo $result;

ソケット

サーバーに送信されるヘッダーをさらに制御する必要がある場合は、ソケットをと組み合わせて使用できますfopen。

$fp = fsockopen("www.4wtech.com/csp/web/Employee/Login.csp", 80, $errno, $errstr, 30);
if (!$fp) {
    $result = "$errstr ($errno)<br />\n";
} else {
    $result = '';
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: www.4wtech.com/csp/web/Employee/Login.csp\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    while (!feof($fp)) {
        $result .= fgets($fp, 128);
    }
    fclose($fp);
}
echo $result;

ストリーム

または、ストリームを使用することもできます。fopenストリームはソケットに似ており、との両方と組み合わせて使用できますfile_get_contents。

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

$result = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp', false, $context);
echo result;

カール

サーバーが cURL をサポートしている場合 (通常はサポートしています)、cURL を使用することをお勧めします。cURL を使用する主な利点は、他のプログラミング言語で一般的に使用されている一般的な C ライブラリに依存していることです。また、リクエストヘッダーを作成するための便利な方法を提供し、エラーが発生した場合のシンプルなインターフェイスを使用して、レスポンスヘッダーを自動解析します。

$defaults = array( 
    CURLOPT_URL, "http://www.4wtech.com/csp/web/Employee/Login.csp"
    CURLOPT_HEADER=> 0
);

$ch = curl_init(); 
curl_setopt_array($ch, ($options + $defaults)); 
if( ! $result = curl_exec($ch)) { 
    trigger_error(curl_error($ch)); 
} 
curl_close($ch); 
echo $result;

ライブラリ

または、多くの PHP ライブラリの 1 つを使用することもできます。ただし、ライブラリの使用はお勧めしません。ほとんどの場合、内部で cURL を使用して独自の HTTP クラスを作成する方が適切です。

HTML 解析

PHP には、任意の HTML を .xml ファイルにロードする便利な方法がありますDOMDocument。

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$doc = new DOMDocument();
$doc->loadHTML($pagecontent);
echo $doc->saveHTML();

残念ながら、HTML5 に対する PHP のサポートは限られています。ページコンテンツを解析しようとしてエラーが発生した場合は、サードパーティのライブラリの使用を検討してください。そのために、Masterminds/html5-phpをお勧めします。このライブラリを使用して HTML ファイルを解析することは、を使用して HTML ファイルを解析することと非常によく似ていますDOMDocument。

use Masterminds\HTML5;

$pagecontent = file_get_contents('http://www.4wtech.com/csp/web/Employee/Login.csp');
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
echo $html5->saveHTML($dom);

または、たとえばを使用できます。私のライブラリPHPPowertools/DOM-Query。HTML5 文字列を DomDocument に解析するために HTML ファイルを解析するために内部でMasterminds/html5-phpを使用し、CSS セレクターを XPath セレクターに変換するためにsymfony/DomCrawlerを使用します。適切なパフォーマンスを確保するために、あるオブジェクトを別のオブジェクトに渡す場合でも、常に同じ DomDocument を使用します。

namespace PowerTools;

// Get file content
$pagecontent = file_get_contents( 'http://www.4wtech.com/csp/web/Employee/Login.csp' );

// Define your DOMCrawler based on file string
$H = new DOM_Query( $pagecontent );

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query( $H->select('body') );

// Passing a string (CSS selector)
$s = $H->select( 'div.foo' );

// Passing an element object (DOM Element)
$s = $H->select( $documentBody );

// Passing a DOM Query object
$s = $H->select( $H->select('p + p') );

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

score 0 · Accepted Answer

プラグインで申し訳ありませんが、画面スクレイピング用にJS_Extractorを作成しました。これは実際には DOM 拡張の非常に単純な拡張であり、いくつかのヘルパーメソッドを使用して作業を少し簡単にしていますが、非常にうまく機能します。