0

I have a list of 1.6 millon lines that looks like this:

N123HN  /var/foo/bar/baz/A/Alpha.file.1234.bin
N123HN  /var/foo/bar/baz/A/Alpha.file.1235.bin
N123KL  /var/foo/bar/baz/A/Alpha.file.1236.bin

I have a Perl script that basically just greps this data on the second column, as a way of looking up the value in the first column (then it does other magic with the "N123HN" value, etc.). As it is now, my app spends about 4 minutes ingesting the file and loading it into a huge hash (key/value array). While the grep-like functions themselves are slow for obvious reasons, the slowest part of running this script is this huge ingest of data each time it runs.

Anyone have any clever ideas how to access this data more quickly? Since it is just a list of two columns, a relational database seems pretty heavyweight for this use case.

I'm re-editing the original question here since pasting source code into the comments boxes is pretty ugly.

The algorithm I'm using to ingest the huge file is this:

while(<HUGEFILE>)
    {
      # hugefile format:
      # nln N123HN ---- 1 0 1c44f5.4a6ee12 17671854355 /var/foo/bar/baz/A/Alpha.file.1234.bin 0

      next if /^(\s)*$/;      # skip blank lines
      chomp;                  # remove trailing newline characters
      @auditrows = split;     # an array of entire rows, split on whitespace
      my $file_url = $auditrows[7];              # /var/foo/bar/baz/A/Alpha.file.1234.bin 
      my $tapenum  = "$auditrows[1] ";          # N123HN
      $tapenumbers{ $file_url } = $tapenum;      # key   = "/var/foo/bar/baz/A/Alpha.file.1234.bin" 
    }                                           # value = "N123HN"
4

2 に答える 2

8

4分?!?!7秒かかります!!!

$ perl -E'say "N${_}HN  /var/foo/bar/baz/A/Alpha.file.$_.bin" for 1..1_600_000;' >file

$ time perl -E'my %h; while (<>) { my ($v,$k) = split; $h{$k}=$v; }' file

real    0m7.620s
user    0m7.081s
sys     0m0.249s

十分なメモリがなく、スワッピングが発生している可能性がありますか?

于 2013-01-29T03:27:30.793 に答える
1

2 番目の列をキーとし、1 列目を値とするハッシュを使用してみましたか? 次に、200 ほどのファイル パスを繰り返し処理し、ハッシュで直接検索できます。grepこれはおそらく、関数を使用するよりもはるかに高速になるでしょう。データをロードする簡単なスクリプトを次に示します。

#!/usr/bin/perl
my %data;
open(my $fh, 'data') || die;
while (<$fh>) {
    my ($k, $path) = split;
    push @{$data{$path}}, $k;
}
print "loaded data: ", scalar(%data), "\n";

私の perl はかなり錆びていますが、私のラップトップでは 160 万行の入力ファイルで非常に高速に実行されます。

pa-mac-w80475xjagw% head -5 data
N274YQ  /var/foo/bar/baz/GODEBSVT/Alpha.file.9824.bin
N602IX  /var/foo/bar/baz/UISACEXK/Alpha.file.5675.bin
N116CH  /var/foo/bar/baz/GKUQAYWF/Alpha.file.7146.bin
N620AK  /var/foo/bar/baz/DHYRCLUD/Alpha.file.2130.bin
N716YD  /var/foo/bar/baz/NYMSJLHU/Alpha.file.2343.bin
pa-mac-w80475xjagw% wc -l data
 1600000 data
pa-mac-w80475xjagw% /usr/bin/time -l ./parse.pl
loaded data: 1118898/2097152
        5.54 real         5.18 user         0.36 sys
 488919040  maximum resident set size
         0  average shared memory size
         0  average unshared data size
         0  average unshared stack size
    119627  page reclaims
         1  page faults
         0  swaps
         0  block input operations
         0  block output operations
         0  messages sent
         0  messages received
         0  signals received
         0  voluntary context switches
        30  involuntary context switches
于 2013-01-29T03:46:20.113 に答える