php - この Web クローラーで複数の URL を検出する

Question

PHP で Web クローラーを作成し、eBay にクロールさせました。指定された Web ページ内のすべてのリンクを取得しますが、同じリンクの複数の URL を提供する場合もあります。データベースにストレスがかかり、コードを微調整する方法がわかりません。

   <?php

 session_start();

 $domain = "www.ebay.com";

  if(empty($_SESSION['page']))
  {
  $original_file = file_get_contents("http://" . $domain . "/");

 $_SESSION['i'] = 0;

  $connect = mysql_connect("xxxxxx", "xxxxxxxxxx", "xxxxxxxxxxxx");

  if (!$connect)
  {
  die("MySQL could not connect!");
  }

 $DB = mysql_select_db('xxxxxxxxxxxxx');

 if(!$DB)
 {
 die("MySQL could not select Database!");
 }
 }
 if(isset($_SESSION['page']))
 {

 $connect = mysql_connect("xxxxxxxxxxxxx", "xxxxxxxxxxxxx", "xxxxxxxxxxxx");

 if (!$connect)
 { 
 die("MySQL could not connect!");
 }

 $DB = mysql_select_db('xxxxxxxx');

  if(!$DB)
  {
  die("MySQL could not select Database!");
  }
  $PAGE = $_SESSION['page'];
  $original_file = file_get_contents("$PAGE");
   }

  $stripped_file = strip_tags($original_file, "<a>");
  preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is",   $stripped_file, $matches);

  foreach($matches[1] as $key => $value)
   {

  if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
  {
  $New_URL = "http://" . $domain . $value;
  }
  else
  {
  $New_URL = $value;
   } 
  $New_URL = addslashes($New_URL);
  $Check = mysql_query("SELECT * FROM pages WHERE url='$New_URL'");
  $Num = mysql_num_rows($Check);

  if($Num == 0)
  {
  mysql_query("INSERT INTO pages (url)
  VALUES ('$New_URL')");

  $_SESSION['i']++;

  echo $_SESSION['i'] . "";
  }
  echo mysql_error();
  }

  $RandQuery = mysql_query("SELECT DISTINCT * FROM pages ORDER BY RAND() LIMIT 0,1");
  $RandReturn = mysql_num_rows($RandQuery);
  while($row1 = mysql_fetch_assoc($RandQuery))
  {
  $_SESSION['page'] = $row1['url'];
  }
  echo $RandReturn;
  echo $_SESSION['page'];
  mysql_close();

   ?>

score 0 · Accepted Answer

まず、リンクスクレーパーにわずかな問題があります。

あなたが使用している、

            if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
            {
                $New_URL = "http://" . $domain . $value;
            }
            else
            {
                $New_URL = $value;
            }

すべてのタグをストライピングした後。

問題は、リンク HREF が次のような場合です。

<a href='#' ...> or <a href='javascript:func()'> or <a href='img...'> etc...

必要のない無効な URL を準備します。この固有のケース (およびその他のケース) をエスケープするには、strpos() または reg_match() を使用する必要があります。

また、jpg、png、avi、wmv、zip などのファイルにリンクする URL のエスケープを考慮する必要があります。

今あなたの質問のために：

最初にターゲットページのすべての URL を配列に保存する必要があります。その後、この配列内のすべての重複値をダンプする必要があります。これにより、SQL クエリが消費する時間を最小限に抑えることができます...

www.ebay.com を使用した簡単なテスト:

before cleaning duplicate URL's: 196.
after cleaning: 120.

今使用：

SELECT EXISTS(SELECT 1 FROM table1 WHERE ...)

URL がデータベースに既に存在するかどうかを確認するには、その方が高速で信頼性が高くなります。

私の変更であなたのコードを見てください：

    $stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is",  $stripped_file, $matches);
$file_arr = array('#\.jpg$#i', '#\.png$#i', '#\.mpeg$#i', '#\.bmp$#i', '#\.gif$#i', '#\.wmv$#i', '#\.mp3$#i', '#\.avi$#i'); //add files to avoid them.
$avoid = 0; //check if it is a url that links to a file. [0-no, 1-yes]
foreach($matches[1] as $key => $value)
    {
        $value = preg_replace('/\#.*/i', '', $value); //removes pages position.
        if(strpos($value,"http://") != 'FALSE' && 
           strpos($value,"https://") != 'FALSE' && 
           strpos($value,"javascript") != 'FALSE' &&
           strpos($value,"javascript:func") != 'FALSE' &&
           $value != '')
            {
                foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1;  break 1; }  } //check all the file conditions
                $value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
                if ($avoid==0) { $New_URL[$key] = "http://" . $domain . $value . "/"; }
            }
            else
            {
             if(strpos($value,"javascript") != 'FALSE' &&
                strpos($value,"javascript:func") != 'FALSE' &&
                $value != '')
                {
                foreach($file_arr as $val_reg) { preg_match($val_reg, $value, $res); if (isset($res[0])) { $avoid=1;  break 1; }  }//check all the file conditions
                $value = preg_replace('#\/$#i', '', $value) //force '/' at the end of the URL's
                if ($avoid==0) { $New_URL[$key] = $value . "/"; }
                }
            }
    }

    //check for duplicate before storing the URL:
    foreach($New_URL as $check)
    {
      $check = mysql_real_escape_string($check);
      $Check_ex = "SELECT EXISTS (SELECT 1 FROM pages WHERE url='$check' LIMIT 1)"; // EXISTS will RETURN 1 if exists ...
        if (@mysql_num_rows(mysql_query($Check_ex))!=1) {
                                                            //Insert your query here......
                                                        } 
                                                        else 
                                                        {
                                                            //Dont store your query......
                                                        }
    }

最もクリーンなコードではありませんが、動作するはずです...

php - この Web クローラーで複数の URL を検出する

1 に答える 1

Related

Reference