perl - Perl での大きなファイルの解析

Question

2200 万行を含む大きなファイル (2GB) を別のファイルと比較する必要があります。Tie::File.so を使用している間は処理に時間がかかるので、「while」で処理しましたが、問題は残ります。以下の私のコードを参照してください...

use strict;
use Tie::File;
# use warnings;
my @arr;
# tie @arr, 'Tie::File', 'title_Nov19.txt';

# open(IT,"<title_Nov19.txt");                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
# my @arr=<IT>;
# close(IT);
open(RE,">>res.txt");

open(IN,"<input.txt");

while(my $data=<IN>){
    chomp($data);
    print"$data\n";
    my $occ=0;

    open(IT,"<title_Nov19.txt");    
    while(my $line2=<IT>){

        my $line=$line2;
        chomp($line);

        if($line=~m/\b$data\b/is){

            $occ++;

        }

    }
print RE"$data\t$occ\n";
}


close(IT);
close(IN);
close(RE);

だからそれを減らすのを手伝ってください...

score 2 · Accepted Answer

これには多くの問題があります。

通常の（、、2引数の使用の欠如use strict、結果のチェックなし、グローバルファイルハンドルの使用）は別として、あなたの場合の特定の問題は、最初。これは非常に遅くなります。use warningsopen()open()

ファイルtitle_Nov19.txtを一度開いて、すべての行を配列またはハッシュなどに読み取ってから閉じることをお勧めします。次に、最初のファイルを開いてinput.txt、配列内のものと比較しながら、そのファイルを 1 回歩くことができるので、その 2 番目のファイルを常に開く必要はありません。

さらに、スタイルなどに関するいくつかの基本的な記事を読むことをお勧めします。あなたの質問が実際に漠然とした最新の基準で書かれている場合、より注目を集める可能性が高いからです。

score 0 · Accepted Answer

memoweの (ありがとう) データを使用する別のオプションを次に示します。

use strict;
use warnings;
use File::Slurp qw/read_file write_file/;

my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';

for ( read_file 'title_Nov19.txt' ) {
    my %seen;
    !$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}

write_file 'res.txt', map "$_\t$count{$_}\n",
  sort { $count{$b} <=> $count{$a} } keys %count;

に数値的にソートされた出力res.txt:

foo 3
bar 1

メタ文字 ( ) を引用する代替正規表現\Q$_\Eが構築されて使用されるため、大きなファイルの行に対して 1 つのパスのみが必要です。ハッシュ%seenは、入力単語が 1 行につき 1 回だけカウントされるようにするために使用されます。

お役に立てれば！

score 0 · Accepted Answer

I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as @LeoNerd explained in his answer. Then I use a hash to keep track of the match count:

#!/usr/bin/env perl

use strict;
use warnings;

# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my @comparison = <$comp_file>);
close $comp_file;

# prepare comparison
open my $input,  '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();

# compare each line
while (my $title = <$input>) {
    chomp $title;

    # iterate comparison strings
    foreach my $comp (@comparison) {
        $count{$comp}++ if $title =~ /\b$comp\b/i;
    }
}

# done
close $input;

# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (@comparison) {
    print $output "$comp\t$count{$comp}\n";
}
close $output;

Just to get you started... If someone wants to further work on this: these were my test files:

title_Nov19.txt

This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!

input.txt

foo
bar

And the result of the program was written to res.txt:

foo 3
bar 1

score 0 · Accepted Answer

0

これを試して：

grep -i -c -w -f input.txt title_Nov19.txt > res.txt

于 2018-03-13T01:21:01.797 に答える

perl - Perl での大きなファイルの解析

4 に答える 4

Related

Reference