html - ネストされたhtmlタグから値を抽出するPerl正規表現

Question

$match = q(<a href="#google"><h1><b>Google</b></h1></a>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";

出力：Google</b></h1>

そのはず：Google

Perl で正規表現を使用してリンクから値を抽出できません。ネストが 1 つ多かれ少なかれ含まれている可能性があります:

<h1><b><i>Google</i></b></h1>

これを試してください：

1) <td><a href="/wiki/Unix_shell" title="Unix シェル">Unix シェル</a>

2) <a href="http://www.hp.com"><h1><b>HP</b></h1></a>

3) <a href="/wiki/Generic_programming" title="ジェネリックプログラミング">ジェネリック</a></td>);

4) <a href="#cite_note-1"><span>[</span>1<span>]</span></a>

出力：

Unix シェル

HP

ジェネリック

[1]

score 5 · Accepted Answer

コメントに記載されているように、正規表現を使用しないでください。CSS セレクターを使用できるMojo suiteが特に気に入っています。

use Mojo;

my $dom = Mojo::DOM->new(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->at('a[href="#google"]')->all_text, "\n";

またはHTML::TreeBuilder::XPath：

use HTML::TreeBuilder::XPath;

my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->findvalue('//a[@href="#google"]'), "\n";

score 2 · Accepted Answer

これを試して：

if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)

それは「タグの後とタグhrefの間のすべてを取る必要があります<b>...</b>

代わりに、「最後以降>と最初の前のすべてを取得するには</、次を使用できます。

<a.*?href.*?>([^>]*?)<\/

score 0 · Accepted Answer

PCRE でサンプリングされたすべての入力に対して機能するこの正規表現を思いつきました。この正規表現は、末尾再帰パターン (?1)* を持つ通常の文法と同等です。

(?<=>)((?:\w+)(?:\s*))(?1)*

返された配列の最初の要素、つまりarray[0]を取得するだけです

score 0 · Accepted Answer

~~この単純なケースでは、次~~のように使用できます。要件はもはや単純ではありません。HTML パーサーの使用方法については、@amon の回答を参照してください。

/<a.*?>([^<]+)</

との間にa何かが見つかるまで、開始タグに一致します。><

他の人が述べたように、通常は HTML パーサーを使用する必要があります。

echo '<td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>
<a href="http://www.hp.com"><h1><b>HP</b></h1></a>
<a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic

html - ネストされたhtmlタグから値を抽出するPerl正規表現

4 に答える 4

Related

Reference