perl - WWW::Mechanize を使用して複数のリンクからテキストをダウンロードする

Question

丸 1 週間、Web ページからリンクをダウンロードし、各リンクをループして各リンクのページに書かれたコンテンツをダンプするコードを作成しようと試みてきました。私がダウンロードした元の Web ページには、私にとって重要な情報を含む個別の Web ページへの 500 のリンクがあります。レベルを1つ下げたいだけです。しかし、私はいくつかの問題を抱えています。

要約: Web ページからリンクをダウンロードし、それらのリンクに含まれるテキストをプログラムで自動的に印刷したいと考えています。それらをファイルに印刷することをお勧めします。

1) 元の Web サイトからリンクをダウンロードすると、役立つリンクが完全に書き出されていません。(つまり、「/festevents.nsf/all?openform」と表示されますが、これは使用可能な Web ページではありません)

2) ページのテキストコンテンツを印刷できません。フォントの詳細を印刷できましたが、それは役に立ちません。

     #Download all the modules I used#
     use LWP::UserAgent;
use HTML::TreeBuilder;
use HTML::FormatText;
use WWW::Mechanize;
use Data::Dumper;

#Download original webpage and acquire 500+ Links#

$url = "http://wx.toronto.ca/festevents.nsf/all?openform";

my $mechanize = WWW::Mechanize->new(autocheck => 1);

$mechanize->get($url);


my $title = $mechanize->title;

print "<b>$title</b><br />";


my @links = $mechanize->links;


foreach my $link (@links) {

   # Retrieve the link URL
   my $href = $link->url_abs;

  #
  # $URL1= get("$link");
  #
  my $ua = LWP::UserAgent->new;
  my $response = $ua->get($href);
  unless($response->is_success) {
    die $response->status_line;
  }
  my $URL1 = $response->decoded_content;
  die Dumper($URL1);

#This part of the code is just to "clean up" the text
$Format=HTML::FormatText->new;
$TreeBuilder=HTML::TreeBuilder->new;
$TreeBuilder->parse($URL1);
$Parsed=$Format->format($TreeBuilder);

open(FILE, ">TorontoParties.txt");
print FILE "$Parsed";
close (FILE);

 }

私を助けてください！私は絶望的です！可能であれば、各ステップの背後にあるロジックを説明してください。私はこれについて一週間頭を悩ませてきましたが、問題の背後にある他の人々の論理を理解するのに助けが必要です.

score 1 · Accepted Answer

働き過ぎ。WWW::Mechanize API を調べて、ほとんどすべての機能が既に組み込まれていることを理解してください。未テスト:

use strictures;
use WWW::Mechanize qw();
use autodie qw(:all);

open my $h, '>:encoding(UTF-8)', 'TorontoParties.txt';
my $mechanize = WWW::Mechanize->new;
$mechanize->get('http://wx.toronto.ca/festevents.nsf/all?openform');
foreach my $link (
    $mechanize->find_all_links(url_regex => qr'/festevents[.]nsf/[0-9a-f]{32}/[0-9a-f]{32}[?]OpenDocument')
) {
    $mechanize->get($link->url_abs);
    print {$h} $mechanize->content(format => 'text');
}
close $h;

perl - WWW::Mechanize を使用して複数のリンクからテキストをダウンロードする

1 に答える 1

Related

Reference