php - Preg_match_all が yahoo の結果で機能しない

Question

さて、preg_match_all は Yahoo に対しては機能しません。

cURL curl_multi_getcontent メソッドを使用して、Yahoo から取得したすべての結果を preg_match_all しようとしています。

サイトの取得には成功しましたが、リンクの結果を取得しようとすると、何にも一致しません。Notepad ++で正規表現を使用している場合、成功しますが、PHPでは明らかに成功しません。

私は現在使用しています：

preg_match_all(
    '#<span class="url" id="(.*?)">(.+?)</span>#si', $urlContents[2], $yahoo
);

[http://se.search.yahoo.com/search?p=random&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t][1]たとえばの HTML を確認すると、すべてのリンクがで始まりで<span class="url" id="something random">終わることがわかります</span>。

この情報を取得する方法について誰か助けてもらえますか? 各結果への実際のリンクアドレスのみが必要です。

PHP スクリプト全体

public function multiSearch($question)
{
    $sites['google'] = "http://www.google.com/search?q={$question}&gl=sv";
    $sites['bing'] = "http://www.bing.com/search?q={$question}";
    $sites['yahoo'] = "http://se.search.yahoo.com/search?p={$question}";

    $urlHandler = array();

    foreach($sites as $site)
    {
        $handler = curl_init();
        curl_setopt($handler, CURLOPT_URL, $site);
        curl_setopt($handler, CURLOPT_HEADER, 0);
        curl_setopt($handler, CURLOPT_RETURNTRANSFER, 1);

        array_push($urlHandler, $handler);
    }

    $multiHandler = curl_multi_init();
    foreach($urlHandler as $key => $url)
    {
        curl_multi_add_handle($multiHandler, $url);
    }

    $running = null;
    do
    {
        curl_multi_exec($multiHandler, $running);
    }
    while($running > 0);

    $urlContents = array();
    foreach($urlHandler as $key => $url)
    {
        $urlContents[$key] = curl_multi_getcontent($url);
    }

    foreach($urlHandler as $key => $url)
    {
        curl_multi_remove_handle($multiHandler, $url);
    }

    foreach($urlContents as $urlContent)
    {
        preg_match_all('/<li class="g">(.*?)<\/li>/si', $urlContent, $matches);
        //$this->view_data['results'][] = "Random";
    }
    preg_match_all('#<cite>(.+?)</cite>#si', $urlContents[1], $googleLinks);
    preg_match_all('#<span class="url" id="(.*)">(.+?)</span>#si', $urlContents[2], $yahoo);
    var_dump($yahoo);
    die();
    $findHtml = array('/<cite>/', '/<\/cite>/', '/<b>/', '/<\/b>/', '/ /', '/"/', '/<strong>/', '/<\/strong>/');
    $removeHtml = array('', '', '', '', '', '', '', '');
    foreach($googleLinks as $links => $val)
    {
        foreach($val as $link)
            $this->view_data['results'][] = preg_replace($findHtml, $removeHtml, $link);
        break;
    }
}

score 2 · Accepted Answer

まず、HTML の処理に正規表現を使用しないでください。PHP で使用できる非常に優れた DOM パーサーがあります。例えば：

$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//span[@class="url"]') as $node) {
        // process each node the way you wish
        // print the id for instance
        echo $node->getAttribute('id'), PHP_EOL;
}

id="(.*)"それに加えて、貪欲であることを除いて、式は機能するはずです。これは次の方法で修正できます。

#<span class="url" id="(.*?)">(.+?)</span>#si

id="..."と>;の後にさらにテキストがある可能性があります。これにより、式は次のようになります。

#<span class="url" id="(.*?)"[^>]*>(.+?)</span>#si

php - Preg_match_all が yahoo の結果で機能しない

PHP スクリプト全体

1 に答える 1

Related

Reference