regex - 特定の形式のすべてのリンクを抽出する

Question

すべてのリンクを削除したいページがあります (例: http://www.stephenfry.com/ )。http://www.stephenfry.com/WHATEVERという形式のすべてのリンクを配列に入れたいと思います。私が今持っているのは、次の方法だけです。

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

# I ONLY WANT TO USE JUST THESE

my $url = 'http://www.stephenfry.com/';

my $doc = get( $url );

my $adt = HTML::Tree->new();
$adt->parse( $doc );

my @juice = $adt->look_down(
    _tag => 'a',
    href => 'REGEX?'
);

これらのリンクだけを入れる方法がわかりません。

score 1 · Accepted Answer

extract_links()ではなく、メソッドを使用する必要がありますlook_down()。

use strict;
use warnings;
use LWP::Simple;
use HTML::Tree;

my %seen;
my $url = 'http://www.stephenfry.com/';
my $doc = get($url);

my $adt = HTML::Tree->new();
$adt->parse($doc);
my $links_array_ref = $adt->extract_links('a');

my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->[0],
  @$links_array_ref;

print "$_\n" for @links;

部分的な出力:

http://www.stephenfry.com/
http://www.stephenfry.com/blog/
http://www.stephenfry.com/category/blessays/
http://www.stephenfry.com/category/features/
http://www.stephenfry.com/category/general/
...

WWW::Mechanizeを使用する方が簡単で、より多くのリンクが返されます。

use strict;
use warnings;
use WWW::Mechanize;

my %seen;
my $mech = WWW::Mechanize->new();
$mech->get('http://www.stephenfry.com/');
my @links = grep { /www.stephenfry.com/ and !$seen{$_}++ } map $_->url,
  $mech->links();

print $_, "\n" for @links;

部分的な出力:

http://www.stephenfry.com/wp-content/themes/fry/images/favicon.png
http://www.stephenfry.com/wp-content/themes/fry/style.css
http://www.stephenfry.com/wordpress/xmlrpc.php
http://www.stephenfry.com/feed/
http://www.stephenfry.com/comments/feed/
...

お役に立てれば！

regex - 特定の形式のすべてのリンクを抽出する

1 に答える 1

Related

Reference