php - PHPを使用してURLからコンテンツをより高速に取得する

Question

私はphpを使用しています。URLからコンテンツをより高速に取得したいと考えています。
これが私が使用するコードです。
コード:(1)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    echo $content;
?>

fopen()などのファイルを読み取る方法は他にもたくさんありますが、これらの方法よりも速い readfile()と思います。file_get_contents()

上記のコードを実行すると、この Web サイトのすべてのものに画像や広告が表示されることがわかります。css スタイルのないプランの html テキスト、画像、広告のみを取得したい。どうすればこれを入手できますか。
これを見て理解してください。
コード:(2)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    // do something to remove css-style, images and ads.
    // return the plain html text in $mod_content.
    echo $mod_content;
?>

上記のようにする$contentと、変数の完全なコンテンツを既に取得してから変更するため、間違った方法になります。
ここでは、url から直接プレーンな html テキストを取得する任意の関数メソッドまたはその他のものを指定できます。

以下のコードは理解できるように書かれていますが、これは元の php コードではありません。
理想的なコード: (3);

<?php
    $plain_content = get_plain_html('http://www.filehippo.com');
    echo $plain_content; // no css-style, images and ads.
?>

この機能を取得できれば、他の機能よりもはるかに高速になります。可能でしょうか。
ありがとう。

score 4 · Accepted Answer

これを試して。

$content = file_get_contents('http://www.filehippo.com');
$this->html =  $content;
$this->process();
function process(){

    // header
    $this->_replace('/.*<head>/ism', "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE html PUBLIC '-//WAPFORUM//DTD XHTML Mobile 1.0//EN' 'http://www.wapforum.org/DTD/xhtml-mobile10.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head>");

    // title
    $this->_replace('/<head>.*?(<title>.*<\/title>).*?<\/head>/ism', '<head>$1</head>');

    // strip out divs with little content
    $this->_stripContentlessDivs();

    // divs/p
    $this->_replace('/<div[^>]*>/ism', '') ;
    $this->_replace('/<\/div>/ism','<br/><br/>');
    $this->_replace('/<p[^>]*>/ism','');
    $this->_replace('/<\/p>/ism', '<br/>') ;

    // h tags
    $this->_replace('/<h[1-5][^>]*>(.*?)<\/h[1-5]>/ism', '<br/><b>$1</b><br/><br/>') ;


    // remove align/height/width/style/rel/id/class tags
    $this->_replace('/\salign=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sheight=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\swidth=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sstyle=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\srel=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sid=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sclass=(\'?\"?).*?\\1/ism','');

    // remove coments
    $this->_replace('/<\!--.*?-->/ism','');

    // remove script/style
    $this->_replace('/<script[^>]*>.*?\/script>/ism','');
    $this->_replace('/<style[^>]*>.*?\/style>/ism','');

    // multiple \n
    $this->_replace('/\n{2,}/ism','');

    // remove multiple <br/>
    $this->_replace('/(<br\s?\/?>){2}/ism','<br/>');
    $this->_replace('/(<br\s?\/?>\s*){3,}/ism','<br/><br/>');

    //tables
    $this->_replace('/<table[^>]*>/ism', '');
    $this->_replace('/<\/table>/ism', '<br/>');
    $this->_replace('/<(tr|td|th)[^>]*>/ism', '');
    $this->_replace('/<\/(tr|td|th)[^>]*>/ism', '<br/>');

    // wrap and close

}
private function _replace($pattern, $replacement, $limit=-1){
    $this->html = preg_replace($pattern, $replacement, $this->html, $limit);
}

詳細 - https://code.google.com/p/phpmobilizer/

score 0 · Accepted Answer

正規表現を使用して css-script のタグとイメージのタグを削除できます。これらのコードを空白に置き換えるだけです

preg_replace($pattern, $replacement, $string);

関数の詳細については、http: //php.net/manual/en/function.preg-replace.phpを参照してください。

php - PHPを使用してURLからコンテンツをより高速に取得する

2 に答える 2

Related

Reference