csv - タブ区切りの値ファイルで文字列のインスタンスを数える方法は?

Question

タブ区切り値 (tsv) ファイル内の文字列のインスタンスをカウントする方法は?

tsv ファイルには何億もの行があり、それぞれが形式

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

. ファイルの 2 番目の列全体で一意の各整数のインスタンスをカウントし、理想的には各行の 5 番目の値としてカウントを追加する方法は?

foobar1  1  xxx   yyy  1
foobar1  2  xxx   yyy  2
foobar2  2  xxx   yyy  2 
foobar2  3  xxx   yyy  2
foobar1  3  xxx   zzz  2

UNIX コマンドラインストリーム処理プログラムのみを使用するソリューションを好みます。

score 0 · Accepted Answer

perl2番目の列の値がソートされていると仮定して使用する1つのソリューション、つまり、 value が見つかった場合2、同じ値を持つすべての行が連続します。スクリプトは、2 列目に別の値が見つかるまで行を保持し、カウントを取得して出力し、メモリを解放するため、入力ファイルの大きさに関係なく問題は発生しません。

の内容script.pl:

use warnings;
use strict;

my (%lines, $count);

while ( <> ) { 

    ## Remove last '\n'.
    chomp;

    ## Split line in spaces.
    my @f = split;

    ## Assume as malformed line if it hasn't four fields and omit it.
    next unless @f == 4;

    ## Save lines in a hash until found a different value in second column.
    ## First line is special, because hash will always be empty.
    ## In last line avoid reading next one, otherwise I would lose lines
    ## saved in the hash.
    ## The hash will ony have one key at same time.
    if ( exists $lines{ $f[1] } or $. == 1 ) { 
        push @{ $lines{ $f[1] } }, $_; 
        ++$count;
        next if ! eof;
    }   

    ## At this point, the second field of the file has changed (or is last line), so 
    ## I will print previous lines saved in the hash, remove then and begin saving 
    ## lines with new value.

    ## The value of the second column will be the key of the hash, get it now.
    my ($key) = keys %lines;

    ## Read each line of the hash and print it appending the repeated lines as
    ## last field.
    while ( @{ $lines{ $key } } ) { 
        printf qq[%s\t%d\n], shift @{ $lines{ $key } }, $count;
    }   

    ## Clear hash.
    %lines = (); 

    ## Add current line to hash, initialize counter and repeat all process 
    ## until end of file.
    push @{ $lines{ $f[1] } }, $_; 
    $count = 1;
}

の内容infile:

foobar1  1  xxx   yyy
foobar1  2  xxx   yyy
foobar2  2  xxx   yyy
foobar2  3  xxx   yyy
foobar1  3  xxx   zzz

次のように実行します。

perl script.pl infile

次の出力で：

foobar1  1  xxx   yyy   1
foobar1  2  xxx   yyy   2
foobar2  2  xxx   yyy   2
foobar2  3  xxx   yyy   2
foobar1  3  xxx   zzz   2

csv - タブ区切りの値ファイルで文字列のインスタンスを数える方法は?

2 に答える 2

Related

Reference