perl - Perl の LWP::UserAgent を使用して、同じ URL を異なるクエリ文字列で取得するにはどうすればよいですか?

Question

次の URL に適用する必要がある実行中の LWP::UserAgent があります。

http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503

これは、多くの同様のターゲットで実行され、次のエンディングが表示されます。

html?show_school=5503
html?show_school=9002
html?show_school=5512

LWP::UserAgent を使用してこれを行いたい:

for my $i (0..10000) 

{ $ua->get(' [here the URL should be applied] ', id => 21, extern_uid => $i); 
# process reply }

いずれにせよ、そのような仕事にこのようなループを使用することは、それを行う方法です。LWP の API は、コア Perl の機能を置き換えることを目的としているわけではなく、Perl ループを使用して複数の URL を照会できると思います。

ループを適用する必要があるために実行されないコード:

#use strict;

use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools
my ($url = '[here the url should be applied] =',id);

for my $id (0..10000) {
  $ua->get(' [here the url should be applied ] ', id => 21, extern_uid => $i);
  # process reply
}  

#my $request = POST $url,
#                 [
#         Schulsuche=> "Ergebnisse anzeigen",
#         order => "schule_ort",
#         schulname => undef, 
#         schulort => undef, 
#         typid => "11",
#         verbinder => "AND"
#                 ];

my $ua = LWP::UserAgent->new;
print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get("[url]=> &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='contentitem']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}

10 月 25 日日曜日の更新: OmnipotentEntity からのアドバイスを適用しました。

#!/usr/bin/perl -W

use strict;
use warnings;         # give out some warnings if something does not run well
use diagnostics;      # tell me when something is wrong 
use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools

my $ua = LWP::UserAgent->new;

$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); 

#pretending to be firefox on linux.

for my $i (0..10000) {
  my $request = HTTP::Request->new(GET => sprintf(" here to put the URL into =%d", $i));
  $request->header('Accept' => 'text/html');
  my $response = $ua->request($request);
  if ($response->is_success) {
    $pagecontent = $response -> content;
  }
# now we can do whatever with the $pagecontent

}
my $request = POST $url,
[
          order => "schule_ort",
          schulname => undef, 
          Basisdaten => undef,        
          Profil  => undef, 
          Schulort => undef, 
          typid => "11",
          Fax  => 
          Homepage  => undef, 
          verbinder => "AND"

];

print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get(" here to put the URL into => &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='floatbox']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}

結果をループしたいので、対応する URL を適用しようとしましたが、大量のエラーが発生しました。

suse-linux:/usr/perl # perl perl_mecha_example_two.pl
グローバル シンボル "$pagecontent" には、perl_mecha_example_two.pl の 24 行目に明示的なパッケージ名が必要です。
グローバル シンボル "$url" には、perl_mecha_example_two.pl の 29 行目に明示的なパッケージ名が必要です。
コンパイルエラーにより perl_mecha_example_two.pl の実行が中止されました (#1)
    (F) 「use strict」または「use strict vars」と言いましたが、これは
    すべての変数はレキシカル スコープ ("my" または "state" を使用) でなければならないこと、
    「our」を使用して事前に宣言するか、明示的に修飾して言う
    グローバル変数が含まれているパッケージ (「::」を使用)。

ユーザー コードからのキャッチされない例外:
グローバル シンボル "$pagecontent" には、perl_mecha_example_two.pl の 24 行目に明示的なパッケージ名が必要です。
グローバル シンボル "$url" には、perl_mecha_example_two.pl の 29 行目に明示的なパッケージ名が必要です。
コンパイル エラーのため、perl_mecha_example_two.pl の実行が中止されました。
perl_mecha_example_two.pl 86行目

今デバッグ部分。何を変更すればよいですか？URL を正しい方法で適用する方法は?

strict を使用する場合、宣言する前に変数を使用することはできません。通常の修正はmy、たとえばmy $url、my $pagecontent最初に出現したときにを先頭に追加することです。

score 4 · Accepted Answer

次のように簡単です。

#!/usr/bin/perl -W

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); #pretending to be firefox on linux.
for my $i (0..10000) {
  my $req = HTTP::Request->new(GET => sprintf("http://path/to/url?=%d", $i));
  $req->header('Accept' => 'text/html');
  my $res = $ua->request($req);
  if ($res->is_success) {
    $pagecontent = $res -> content;
  }
# Do whatever with the $pagecontent
}

これは、10000 ページすべてをフェッチすることを前提としています。特定のものだけを取得したい場合は、それらの数値を配列にスローしてから、1..10000 ではなく、その配列をウォークする必要があります。

perl - Perl の LWP::UserAgent を使用して、同じ URL を異なるクエリ文字列で取得するにはどうすればよいですか?

1 に答える 1

Related

Reference