perl - データの特定のサブセットで 1 回だけ出現する特定の単語を置き換えるにはどうすればよいですか?

Question

以下のデータセットを検討してください。数字で始まる各チャンクは「ケース」です。実際のデータセットには、数十万のケースがあります。ケースに Exclusion という単語が 1 つしかない場合 (例: case 10001)、「Exclusion」という単語を「0」に置き換えたいと思います。

行をループすると、それぞれの場合にいくつの「除外」があるかを数えることができます。しかし、「除外」という単語が 1 行しかない場合、その行に戻ってその単語を置き換える方法がわかりません。

どうやってやるの？

10001
M1|F1|SP1;12;12;12;11;13;10;Exclusion;D16S539
M1|F1|SP1;12;10;12;9;11;9;3.60;D16S
M1|F1|SP1;12;10;10;7;11;7;20.00;D7S
M1|F1|SP1;13;12;12;12;12;12;3.91;D13S
M1|F1|SP1;11;11;13;11;13;11;3.27;D5S
M1|F1|SP1;14;12;14;10;12;10;1.99;CSF
10002
M1|F1|SP1;8;13;13;8;8;12;2.91;D16S
M1|F1|SP1;13;11;13;10;10;10;4.13;D7S
M1|F1|SP1;12;9;12;10;11;16;Exclusion;D13S
M1|F1|SP1;12;10;12;10;14;15;Exclusion;D5S
M1|F1|SP1;13;10;10;10;17;18;Exclusion;CSF

score 2 · Accepted Answer

ファイルを読みながら、ケース内のすべての行をバッファリングし、除外をカウントします。

my ($case,$buf,$count) = (undef,"",0);
while(my $ln = <>) {

正規表現を使用してケースを検出し、

    if( $ln =~ /^\d+$/ ) {
        #new case, process/print old case
        $buf =~ s/;Exclusion;/;0;/ if($count==1);
        print $buf;
        ($case,$buf,$count) = ($ln,"",0);
    }

今すぐ「除外」を検出するために正規表現を使用しますか?

    elsif( $ln =~ /;Exclusion;/ ) { $count++; }
    $buf .= $l;
}

処理が完了した後、処理するケースが残っている可能性があります。

if( length($buf)>0 ) {
    $buf =~ s/;Exclusion;/;0;/ if($count==1);
    print $buffer;
}

score 1 · Accepted Answer

ここにはすでに多くの正解があり、バッファを使用して「ケース」の内容を保存しています。

tellとを使用してファイルを巻き戻す別のソリューションを次にseek示します。そのため、バッファーは必要ありません。これは、「ケース」が非常に大きく、パフォーマンスやメモリ使用量に敏感な場合に役立ちます。

use strict;
use warnings;

open FILE, "text.txt";
open REPLACE, ">replace.txt";

my $count = 0;      # count of 'Exclusion' in the current case
my $position = 0;
my $prev_position = 0;
my $first_occur_position = 0;   # first occurence of 'Exclusion' in the current case
my $visited = 0;    # whether the current line is visited before

while (<FILE>) {
    # keep track of the position before reading
    # the current line
    $prev_position = $position;
    $position = tell FILE;

    if ($visited == 0) {
        if (/^\d+/) {
            # new case
            if ($count == 1) {
                # rewind to the first occurence 
                # of 'Exclusion' in the previous case
                seek FILE, $first_occur_position, 0; 
                $visited = 1;
            }
            else {
                print REPLACE $_;
            }
        }
        elsif (/Exclusion/) {
            $count++;
            if ($count > 1) {
                seek FILE, $first_occur_position, 0;
                $visited = 1;
            }
            elsif ($count == 1) {
                $first_occur_position = $prev_position;
            }
        }
        else {
            print REPLACE $_ if ($count == 0);
        }

        if (eof FILE && $count == 1) {
            seek FILE, $first_occur_position, 0;
            $visited = 1;
        }
    }
    else {
        if ($count == 1) {
            s/Exclusion/0/;
        }
        if (/^\d+/) {
            $position = tell FILE;
            $visited = 0;
            $count = 0;
        }
        print REPLACE $_;
    }
}

close REPLACE;
close FILE;

score 1 · Accepted Answer

これは私が考えることができる最高です。ファイルを @lines に読み込むとします。

# separate into blocks                                                                 
foreach my $line (@lines) {
    chomp($line);
    if ($line =~ m/^(\d+)/) {
        $key = $1;
    }
    else {
        push (@{$block{$key}}, $line);
    }
}

# go through each block                                                                
foreach my $key (keys %block) {
    print "$key\n";
    my @matched = grep ($_ =~ m/exclusion/i, @{$block{$key}});
    if (scalar (1 == @matched)){
        foreach my $line (@{$block{$key}}) {
            $line =~ s/Exclusion/0/i;
            print "$line\n";
        }
    }
    else {
        foreach my $line (@{$block{$key}}) {
            print "$line\n";
        }
    }
}

perl - データの特定のサブセットで 1 回だけ出現する特定の単語を置き換えるにはどうすればよいですか?

4 に答える 4

Related

Reference