perl - Why can't I use the map function to create a good hash from a simple data file in Perl?

Question

The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!

Here's the minimized code to exhibit my problem:

The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding. It has the following three lines:

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

In the output, the hash table seems to be okay:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

But it is actually not, because I only get two values instead of three:

æbәlәuni
әbændәn

Perl gives the following warning message:

Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i n> line 3.

where's the problem? Can someone kindly explain? Thanks.

The Solution

Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.

To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

Now, the output is exactly what I expected:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.

Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.

Note To clarify a little more, if I use:

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

The output is this:

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

And the warning message:

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.

score 7 · Accepted Answer

警告メッセージが少し疑わしいと思います。$in最後の行を読み取った後、ファイルハンドルが行 4 にあるはずなのに、ファイルハンドルが行 3 にあることがわかります。

あなたのコードを試したとき、システムで UTF-8 として保存するように構成されている GVim を使用して入力ファイルを保存しましたが、問題は見られませんでした。出力ファイルを見て、メモ帳で試してみたところ、次のように表示されます。

"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"

BOM\x{feff}はどこにありますか。

Dumper の出力には、前に偽の空白があります (出力ハンドルにabacus指定していない場所)。:utf8

私が最初に述べたように (この投稿の多数の編集に失われました — リマインダーの趣味に感謝します)、'<:utf8'いつ入力ファイルを開くかを指定します。

score 2 · Accepted Answer

UTF8ファイルの読み取り/書き込みを行う場合は、実際にUTF8として読み込んでいることを確認する必要があります。

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

より堅牢にしたい場合は、ファイルの読み取りに、の:encoding(utf8)代わりにを使用することをお勧めします。:utf8

open my $in, '<:encoding(utf8)', "hash_test.txt";

詳細については、 PerlIOをお読みください。

score 1 · Accepted Answer

あなたの答えは、あなたの目の前にあるかもしれないと思います。Data::Dumper投稿した出力は次のとおりです。

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

'とabacus?の間の文字に注意してください。経由で 3 番目の値にアクセスしようとしました$hash{abacus}。ハッシュの前abacusにその文字があるため、これは正しくありません。Dumper()それを処理する必要があるループにプラグインしてみることができます。

foreach my $k (keys %hash) {
  print $out $hash{$k};
}

score 0 · Accepted Answer

0

split/\t/ の代わりに split/\s/

于 2009-11-19T12:42:03.773 に答える

score -1 · Accepted Answer

私のために働きます。あなたの例があなたの実際のコードとデータと一致していると確信していますか？

perl - Why can't I use the map function to create a good hash from a simple data file in Perl?

5 に答える 5

Related

Reference