performance - Perl: パーセンタイルを計算する最も効率的な方法

Question

私は、数ギグ相当のファイルを通過してレポートを生成する perl スクリプトを持っています。

パーセンタイルを計算するために、次のことを行っています

my @values = 0;
while (my $line = <INPUTFILE>){
    .....
    push(@values, $line);

}
# Sort
@values = sort {$a <=> $b} @values; 

# Print 95% percentile
print $values[sprintf("%.0f",(0.95*($#values)))];

これは明らかにすべての値を前もって配列に保存してからパーセンタイルを計算しますが、これはメモリに負担がかかる可能性があります (数百万の値を想定)。これを行うメモリ効率の良い方法はありますか?

score 3 · Accepted Answer

ファイルを 2 回処理できます。最初の実行では、行数のみをカウントします ( $.)。その数から、パーセンタイルを見つけるために必要な最大数のみを保持するスライディングウィンドウのサイズを数えることができます (パーセンタイルが 50 未満の場合は、ロジックを逆にする必要があります)。

#!/usr/bin/perl
use warnings;
use strict;

my $percentile = 95;

my $file = shift;
open my $IN, '<', $file or die $!;

1 while <$IN>;             # Just count the number of lines.
my $line_count = $.;
seek $IN, 0, 0;            # Rewind.

# Calculate the size of the sliding window.
my $remember_count = 1 + (100 - $percentile) * $line_count / 100;

# Initialize the window with the first lines.
my @window = sort { $a <=> $b }
             map scalar <$IN>,
             1 .. $remember_count;
chomp @window;

while (<$IN>) {
    chomp;
    next if $_ < $window[0];
    shift @window;
    my $i = 0;
    $i++ while $i <= $#window and $window[$i] <= $_;
    splice @window, $i, 0, $_;
}
print "$window[0]\n";

performance - Perl: パーセンタイルを計算する最も効率的な方法

1 に答える 1

Related

Reference