regex - Perl で正規表現を使用して複数行の HTML を解析する方法

Question

perl を使用して複数行の文字列を解析しようとしていますが、一致する数しか得られません。これが私が解析しているもののサンプルです:

<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>

このコードを使用して、コンテンツを文字列に格納しようとしています:

@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ",  @a+0,  ", elements: ";
print join (" ", @a);
print "\n";

ただし、div 内のテキストだけでなく、すべてを返します。誰かが私の正規表現のエラーを指摘できますか?

魔理沙

score 7 · Accepted Answer

堅牢なHTMLパーサーの使用：

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text

式は戻りますWow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

score 5 · Accepted Answer

HTML :: TreeBuilder :: XPath：などのHTMLを解析するように設計されたものを使用してください。

#!/usr/bin/env perl

use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;

my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
        Wow, I love the new top bar, so much easier to navigate now :)
        Anywho, got a few other fixes I am working on as well. :)
        I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);

print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];

入力のさまざまなバリエーションを処理する自家製の正規表現パターンに依存しないことによって提供される機能に注目してください。

出力：

---
- '  Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '

score 4 · Accepted Answer

文字列に一致するだけで、何も解析していません。の真ん中にテキストが必要な場合は、次のdivように言う必要があります。

$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;

一致したものが$1変数に保存されます。そのようなのインスタンスが複数ある場合はdiv[class=content]、次のようなループが必要です。

use strict; use warnings;
use Data::Dumper;

my $html = qq~<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
        I still love it.
</div>
<div id="content-ZAJ9E" class="content">
        I cant get enough!
</div>
~;

my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
  my $group = $1;     
  $group =~ s/^\s+//; # strip whitespace at the beginning
  $group =~ s/\s+$//; # and the end

  push @matches, $group;
}
print Dumper \@matches;

とをご覧になることをお勧めしperlreますperlretut。

いくつかのメモ：

いつも use strictそしてuse warnings！
試してみてくださいData::Dumper。変数をデバッグするのは素晴らしいことです。
HTML解析に正規表現を使用することは最善のアイデアではありません。多くの解析を行う場合は、HTML :: Parser、HTML :: TreeBuilder :: XPath、HTML :: TokeParser :: Simple、Mojo :: DOMなど、CPANで利用可能なモジュールの1つを検討するか、検索してください。 SOでそれのために

regex - Perl で正規表現を使用して複数行の HTML を解析する方法

3 に答える 3

Related

Reference