perl - 複数のファイルから共通のキーを見つけて、異なる値を配列に格納し、差分を計算します

Question

私はPerlに非常に慣れていないので、perlで達成したいタスクがあります:

次のような多くのファイルがあります: (スペースで区切られ、それぞれに 6 列と数千行があり、すべてのファイルは *.hgt で終わります)

たとえば .hgt

ID     NAMES           Test1       Test2       Percentage       Height
1      abc100123        A            B          0.21            165
1      abc400123        A            B          0.99            162
1      abc300123        C            B          0.107           165
1      abc200123        A            E          0.31            167
1      abc500123        A            B          0.7             165
....

各 NAMES は、各 .hgt ファイルで一意です。すべての .hgt ファイルで共通の名前を見つけて、すべてのパーセンテージを抽出し、最大数と最小数の最大の違いを見つけたいと考えています。

たとえば、5 つの .hgt ファイルがあり、そのすべてに NAMES = abc300123 が含まれていて、それに応じたパーセンテージが 0.107、0.1、0.4、0.9、0.8 の場合、abc300123 の最大の差は 0.9 - 0.1 = 0.8 になります。

次に、すべてのファイルから計算された名前とその名前に関連付けられた最大の違いを出力したいと思います。出力の順序は、最大の差によってソートされます。各行の前に整数があります (0、1、2、3、...)。例は次のようになります。

出力

0. abc500123 0.1
1. abc900123 0.3
2. abc100123 0.7
3. abc300123 0.8
4. abc110123 0.9
....

各ファイルを読み込もうとし、キー = NAMES と値 = パーセンテージを配列に格納しました。Percentage 配列を並べ替えて、最大値と最小値を新しい配列に格納し、マイナス計算を実行したいと考えています。ある時点で私は立ち往生し、物事をまとめることができませんでした。

これまでに書いたものは次のとおりです。

open(PIPEFROM, "ls *.hgt |") or die "no \.hgt files founded\!\n";  ## find the files that are ended with hgt
$i=0;
@filenames = "";

while($temp = <PIPEFROM>){

    $temp =~ m/\.hgt/;
    print out "$temp";
    $pre = $`; #gives file name without the dot and the hgt extension
    $filenames[$i] = $pre;
    $i++;
} 


%hash = ();
$j=0;
## read in files ended with .hgt
for ($i = 0; $i<=$filenames; $i++) {
$temp = $filenames[$i];

open(PIPETO, "cat $temp.hgt |") or die "no \.hgt files founded\!\n";

<PIPETO>;
while ($temp2 = <PIPETO> ){
    chomp $temp2;
    $temp2 = ~ s/^\s+//;
    @lst = split(/\s+/, $temp2);
    $NAMES = $lst[1];
    $Percentage = $lst[4];
    $hash{$NAMES} .= $Percentage . " ";
}
}
### manipulate the values
foreach $key (sort keys %hash){

    @values = split(/\s+/, $hash{$key});
    if ($#values == $#filenames){
    print "$j" . "\." . " " . "$key" . "\n";
    $j++;
                         ### got stuck
}
}

これを問題に含めようと考えていますが、どこに置くべきかわかりません：

my ($smallest, $largest) = (sort {$a <=> $b} @array)[0,-1];

これはとてもイライラします。どんな親切な助けも大歓迎です！

score 2 · Accepted Answer

Joseph Myers の返信に基づいて、すべてのファイルで発生したデータのみを取得する方法、ヘッダー行 (入力ファイルの行 #1) をスキップする方法、および出力の並べ替えに関する質問に答えるために、いくつかの変更を加えました。最大のパーセンテージから最小のパーセンテージで並べ替え、パーセンテージが等しい場合は名前で並べ替えます。プログラムを実行するためのコマンドラインエントリは次のようになります。

perl output.pl *.hgt.

my $file_count = @ARGV or die "invoke program as:\nperl $0 *.hgt\n";

これにより、すべての *.hgt が @ARGV 配列に読み込まれます (彼のプログラムのように cat を介してそれらをパイプするのではなく)。$file_count次に、読み込まれたファイルの数を記録します@ARGV。while ループは、cat をパイプするのと同様に、に含まれるファイルを読み込みます。

最初のforループでは、名前がすべてのファイルで読み取られたかどうかを確認するためのチェックが行われます ( if ($names{$name}{count} == $file_count))。そうであれば、パーセンテージの差を計算し、そうでなければ%namesハッシュから名前を削除します。

最後のforループは、カスタムソートを使用して結果を出力しますby_percent_name。

#!/usr/bin/perl
use strict;
use warnings;

my $file_count = @ARGV or die "invoke program as:\nperl $0 *.hgt\n";

my %names;
while (<>) {
    next if $. == 1; # throw header out
    my ($name, $perc) = (split ' ')[1,4];
    $names{$name}{count}++;
    my $t = $names{$name}{minmax} ||= [1,0];
    $t->[0] = $perc if $perc < $t->[0];
    $t->[1] = $perc if $perc > $t->[1];
    close ARGV if eof; # reset line counter, '$.',  to 1 for next file
}

for my $name (keys %names) {
    if ($names{$name}{count} == $file_count) {
        $names{$name} = $names{$name}{minmax}[1] - $names{$name}{minmax}[0];
    }
    else {
        delete $names{$name};   
    }
}

my $i;
my $total = keys %names;
for my $name (sort by_percent_name keys %names) {
    printf "%*d. %s %.6f\n", length($total), ++$i, $name, $names{$name};
}

sub by_percent_name {
    $names{$b} <=> $names{$a}   || $a cmp $b
}

score 1 · Accepted Answer

このプログラムは、指定したことを正確に実行します。

# output.pl
# save this entire script as output.pl
# obtain output by running this command:
#
#   cat *.hgt | perl output.pl | more
# (in order to scroll the results--press "q" in order to quit)
#
#   cat *.hgt | perl output.pl > results-largest-differences-output-$$.txt
# in order to create a temporary results file
#
# BE CAREFUL because the second command overwrites whatever is in
# the output file using the ">" operator!
my %names;
my $maxcount = `ls *.hgt | wc -l`;
my %counts;
while (<>) {
my @fields = (m/(\S+)/g);
my $name = $fields[1];
my $perc = $fields[4];
next if $perc =~ m/[^.\d]/;
next unless $perc;
my $t = ($names{$name} ||= [1, 0]);
# initialize min to as high as possible and max to as low as possible
$t->[0] = $perc if $perc < $t->[0];
$t->[1] = $perc if $perc > $t->[1];
$counts{$name}++; # n.b. undef is auto-initialized to 0 before ++
}

for (keys %names) {
$names{$_} = $names{$_}->[1] - $names{$_}->[0];
}

my $n = 0;
for (sort { $names{$a} <=> $names{$b} || $a cmp $b } keys %names) {
next unless $counts{$_} == $maxcount;
$n++;
printf("%6s %20s %.2f\n", $n, $_, $names{$_});
}

perl - 複数のファイルから共通のキーを見つけて、異なる値を配列に格納し、差分を計算します

2 に答える 2

Related

Reference