perl - Perl UTF8 エンコーディングエラー。LWP::UserAgent->decoded_content も Encode::decode も機能しません。他のアイデア？

Question

文字エンコーディングに LWP::Useragent と Encode の両方を使用して Web ページからグローバルアドレスを取り戻そうとすると、perl でエンコーディングの問題が発生します。私はグーグルソリューションを試しましたが、何もうまくいかないようです。Strawberry Perl 5.12.3 を使用しています。

例として、チェコ共和国の米国大使館のアドレスページ (http://prague.usembassy.gov/contact.html) を取り上げます。私が望むのは、アドレスを引き戻すことだけです:

住所: Tržiště 15 118 01 Praha 1 - Malá Strana Czech Republic

ウェブページのヘッダー文字セットと同じ文字エンコード UTF-8 を使用して正しく表示される Firefox はどれですか。しかし、perl を使用してこれをプルバックしてファイルに書き込もうとすると、Useragent または Encode::decode で decoded_content を使用しているにもかかわらず、エンコーディングが乱れているように見えます。

データに正規表現を使用して、データが印刷されたときにエラーが発生しないことを確認しようとしましたが（つまり、perlで内部的に正しい）、エラーはperlがエンコーディングを処理する方法にあるようです。

これが私のコードです：

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

そして、ターミナルへの出力は次のとおりです。

コンテンツタイプ: UTF-8 デコードされていないコンテンツ::
Tr├à┬╛išt├ä┬¢ 15
118 01 プラハ 1 - マラーストラナ
チェコ共和国デコードされたコンテンツ::
Tr┼╛išt─¢ 15
118 01 プラハ 1 - マラーストラナ
チェコ共和国ENCODE::DECODED CONTENT::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
Czech Republic DOUBLE-DECODED CONTENT::Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá StranaCzech Republic アンパサンドがデコードされていないコンテンツで見つかった-デコードされたコンテンツでエンコーディングエラーの可能性が高いアンパサンドが見つかりました - エンコードエラーの可能性が高いアンパサンドが ENCODE::デコードされたコンテンツで見つかりました - 二重デコードされたコンテンツでアンパサンドが見つかりました - エンコードエラーの可能性があります

そしてファイルへ（これは端末とは少し異なりますが、正しくないことに注意してください）。OK WOW - これはスタックオーバーフローでは正しく表示されますが、Bluefish、LibreOffice、Excel、Word、またはコンピューター上の他のものでは表示されません。したがって、データは正しくエンコードされていません。何が起こっているのか本当にわかりません。

コンテンツタイプ: UTF-8 デコードされていないコンテンツ::
TrÅ¾ištÄ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tržiště 15 118
01 Praha 1 - Malá Strana Czech
Republic Malá Strana Czech Republic DOUBLE-DECODED CONTENT::Tržiště 15 118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR二重にデコードされたコンテンツにアンパサンドが見つかりました - エンコーディングエラーの可能性があります

これをどのように行うことができるかについての指針は本当に感謝しています。

ありがとう、イアン/モンテクリスト

score 5 · Accepted Answer

間違いは、正規表現を使用して HTML を解析することです。少なくとも、HTML エンティティのデコードが不足しています。手動で行うか、堅牢なパーサーに任せることができます。

use strictures;
use Web::Query 'wq';
use autodie qw(:all);

open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text

score 2 · Accepted Answer

#!/usr/bin/env perl

use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open     qw(:std :utf8);

use LWP::Simple;
use HTML::Entities;

my $content = get 'http://prague.usembassy.gov/contact.html';

my ($address) = ($content =~  m{<p><b>Address(.*?)</p>});
decode_entities($address);

say $address;

コマンドラインから:

C:\temp> uu > tt.txt

C:\temp> gvim tt.txt

次のテキストが GVim (UTF8 モード) で表示されます。

</b>:<br />Tržiště 15<br />118 01 Praha 1 - Malá Strana<br />Czech Republic

Tom Christiansen の標準プリアンブルも参照してください。

perl - Perl UTF8 エンコーディング エラー。LWP::UserAgent->decoded_content も Encode::decode も機能しません。他のアイデア？

2 に答える 2

Related

Reference

perl - Perl UTF8 エンコーディングエラー。LWP::UserAgent->decoded_content も Encode::decode も機能しません。他のアイデア？