php - 正規表現を使用して基準に基づいて href の説明を抽出する

Question

重複の可能性:
PHP で HTML を解析および処理する方法は?

説明が特定の基準を満たしているかどうかに基づいて、一部の href をリンクの説明に置き換えて、HTML のブロックを解析する必要があります。

特定の文字列を識別するために使用している正規表現は、アプリケーションの他の場所で使用されています。

$regex  = "/\b[FfGg][\.][\s][0-9]{1,4}\b/";
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

href の説明を抽出するための出発点として、次の SO の質問を使用しています。

HTML リンクタグをテキストの説明に置き換える

アイデアは、「FfGg.xxxx」タイプの識別子を持つリンクを変換し、残りはそのままにしておくことです (つまり、Google リンク)。

私がこれまでに持っているものは次のとおりです。

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD 
show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in 
severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.
</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case 
reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a 
href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a 
href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" 
target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a 
href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a 
href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" 
target="F.96">F.96</a>);';

これにより、Google を含むすべてのリンクが変換されます。

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>(.*?)<\/a>/i", "$2", $html);

これは空白の HTML 文字列を返します。

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>[FfGg][\.][\s][0-9]{1,4}<\/a>/i", "$2", $html);

問題は、上記の 2 番目の (機能しない) 例にこの正規表現を埋め込む方法にあると思います。

[FfGg][\.][\s][0-9]{1,4}

上記の preg_replace の例で見つかった FfGg 式を HTML に埋め込む正しい方法は何ですか?

score 2 · Accepted Answer

これを行うDOM（正しい）方法は次のとおりです。

編集:正規表現の改善

<?php

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" target="F.96">F.96</a>);';

    // Create a new DOMDocument and load the HTML string
    $dom = new DOMDocument('1.0');
    $dom->loadHTML($html);

    // Create an XPath object for this DOMDocument
    $xpath = new DOMXPath($dom);

    // Loop over all <a> elements in the document
    // Ideally we would combine the regex into the XPath query, but XPath 1.0
    // doesn't support it
    foreach ($xpath->query('//a') as $anchor) {
        // See if the link matches the pattern
        if (preg_match('/^\s*[gf]\s*\.\s*\d{1,4}\s*$/i', $anchor->nodeValue)) {
            // If it does, convert it to a text node (effectively, un-linkify it)
            $textNode = new DOMText($anchor->nodeValue);
            $anchor->parentNode->replaceChild($dom->importNode($textNode), $anchor);
        }
    }

    // Because you are working with partial HTML string, I extract just that
    // string. If you are actually working with a full document, you can
    // replace all the code below this comment with simply:
    // $result = $dom->saveHTML();

    // A string to hold the result
    $result = '';

    // Iterate all elements that are a direct child of the <body> and convert
    // them to strings
    foreach ($xpath->query('/html/body/*') as $node) {
        $result .= $node->C14N();
    }

    // $result now contains the modified HTML string

動作を確認してください (注: 表示されるエラーメッセージは、指定した HTML 文字列が無効であるためです)

score 2 · Accepted Answer

HTML を正規表現で解析するべきではありません。すべてのケースを正しく処理することはできません。以下は、リンク検索の正規表現を壊す有効な HTML の例です。

<!-- <a href="www.blah.com">   -->    <a href="www.foo.com">F.100</a>
<area>...</area>  ...  <a href="www.foo.com">F.100</a>
<a href="www.foo.com">F.100</a >

より良いアプローチについては、この質問をご覧になることをお勧めします: How do you parse and process HTML/XML in PHP?

score 1 · Accepted Answer

気が進まない量指定子にあまり頼るべきではありません。彼らは消費する文字をできるだけ少なくしようとしますが、全体的な一致を達成するために必要なだけ消費します。HTML が縮小されている場合 (具体的には、改行がほとんどまたはまったくない場合)、それら.*?のそれぞれが文書の残り全体を消費しようとすることになり、何度もそれを行わなければならない場合があります。

これは、一致が不可能な場合に特に当てはまります。敗北を認める前に、テキストを介して可能なすべてのパスを移動する必要があります。もう 1 つの問題は、消極的な数指定子が、開始が早すぎる一致を防げないことです。次の文字列を指定します。

<a href="www.blah.com">...</a> <a href="www.foo.com">F.100</a>

...最初のタグでマッチングを開始し、<a>2 番目のタグの終わりで停止します。この正規表現では:

'~<a\b[^>]*\bhref="[^"]*"[^>]*>([FG]\.\d{1,4})</a>~i'

...すべて.*?を[^>]*or[^"]*に置き換えて、一致のこれらの部分をそれぞれ単一のタグまたは属性値に限定しました。この正規表現ははるかにうまく機能しますが、絶対確実というわけではないことに注意してください。しかし、HTML を正規表現と一致させるときに合理的に得られるのとほぼ同じです。

php - 正規表現を使用して基準に基づいて href の説明を抽出する

3 に答える 3

Related

Reference