php - このHTMLを正規表現で解析するにはどうすればよいですか？

Question

HTMLソースからURLのリストのテキストhrefを抽出するための正規表現を作成しようとしています。テキストは任意の値にすることができますanchor。anchor

HTML部分は次のようになります。

<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>

次の正規表現を試しましたが、</a>タグの前のすべてを取得して失敗するため、機能しません。

preg_match_('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/', $source , $website_array);

必要なデータを抽出するために機能する正規表現は何でしょうか？

score 6 · Accepted Answer

知っておく必要がある場合、式は貪欲であるため、最初のアンカーの開始と最後のアンカーの終了に一致する可能性があります。修飾子はそれ/Uを修正します：

preg_match('/<a rel="nofollow" target="_blank" href="(.*)" class="see-all">(.*)<\/a>/U', $source , $website_array);

pcre.backtrack_limit貪欲でないモードに適用されることに注意してください。

先読みセットを使用すると、パフォーマンスが向上する可能性があります。

preg_match('/<a rel="nofollow" target="_blank" href="([^"]*)" class="see-all">([^<]*)<\/a>/', $source , $website_array);

これは、アンカー自体の内部のタグで問題が発生します。

前述の制限があるため、HTMLパーサーの使用を真剣に検討します。

$d = new DOMDocument;
$d->loadHTML($source);
$xp = new DOMXPath($d);
foreach ($xp->query('//a[@class="see-all"][@rel="nofollow"][@target="_blank"]') as $anchor) {
    $href = $anchor->getAttribute('href');
    $text = $anchor->nodeValue;
}

デモ

これにより、属性を別の順序で適切に処理し、内部でさらにクエリを実行できるようになります。

score 2 · Accepted Answer

試す

preg_match_all('/<a[^>]+href="([^"]+)"[^>]*>([^>]+)<\/a>/is', $source , $website_array);

すべてのリンクに一致し、情報を含む配列を返します。ノート：

[^"]-"以外のすべての文字に一致します

score 1 · Accepted Answer

HTMLを正規表現で解析することは一般的に悪い考えですが（より良い解決策としてDOMDocumentクラスを調べることをお勧めします）、抽出しようとしているものについて非常に具体的な考えがある場合に使用でき、すべての場合において、その可変テキストは実際に正規表現を壊すことはありません。

あなたの場合、あなたは試みるかもしれません：

$pattern = '#<a rel="nofollow" target="_blank" href="(.*)" class="get-all">(.*)</a>#U';
preg_match_all($pattern, $source, $website_array);

最後の貪欲でない修飾子（U）に注意してください。これは、可能な限り最小の一致のみに一致させることが非常に重要です。

score 0 · Accepted Answer

または、次のようにすることもできます。

<?php
$html = <<<HTML
<div class="links"><a rel="nofollow" target="_blank" href="http://url1.com" class="get-all">URL1</a><a rel="nofollow" target="_blank" href="http://url2.com" class="get-all">This is Url-2</a><a rel="nofollow" target="_blank" href="http://url3.com" class="get-all">This is Url-3</a><a rel="nofollow" target="_blank" href="http://url4.com" class="get-all">Sweet URL 4</a></div>
HTML;


$xml = new DOMDocument();
@$xml->loadHTML($html);

$links=array();
$i=0;
//Get all divs
foreach($xml->getElementsByTagName('div') as $divs) {
    //if this div has a class="links"
    if($divs->getAttribute('class')=='links'){
        //loop through this div
        foreach($xml->getElementsByTagName('a') as $a){
            //if this a tag dose not have a class="get-all" continue to next
            if($a->getAttribute('class')!='get-all')
            continue;

            //Assign values to the links array
            $links[$i]['href']=$a->getAttribute('href');
            $links[$i]['value']=$a->nodeValue;
            $i++;
        }

    }
}

print_r($links);
/*
Array
(
    [0] => Array
        (
            [href] => http://url1.com
            [value] => URL1
        )

    [1] => Array
        (
            [href] => http://url2.com
            [value] => This is Url-2
        )

    [2] => Array
        (
            [href] => http://url3.com
            [value] => This is Url-3
        )

    [3] => Array
        (
            [href] => http://url4.com
            [value] => Sweet URL 4
        )

)
*/
?>

php - このHTMLを正規表現で解析するにはどうすればよいですか？

4 に答える 4

Related

Reference