perl - タブ区切りファイルを読み取り、出現回数を数えて行を削除する

Question

私はプログラミングにかなり慣れていないので、この問題を解決しようとしています。このようなファイルがあります。

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    77  T   C   T   T   T   T           T
tg93    79  C   -   C       C   C   -   -   
tg93    79  C   G   C   C   C   C   G       C
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    105 A   G   A   A   A   A   A   G   A
tg93    108 A   G   A   A   A   A   G   A   A
tg93    114 T   C   T   T   T   T   T   C   T
tg93    131 A   C   A   A   A   A   A   A   A
tg93    136 G   C   C   G   C   C   G   G   G
tg93    150 CTCTC   -       CTCTC       -   CTCTC       CTCTC

このファイルの見出しに

CHROM - 名前 POS - 位置 REF - 参照 ALT - 代替 10 - 16_sample.bam - samplesd I

ここで、REF 列と ALT 列の文字が何回出現したかを確認したいと思いました。いずれかが 2 回未満繰り返される場合は、その行を削除する必要があります。

たとえば、最初の行では、 REF に「T」があり、 ALT に「C」があります。7 つのサンプルで、5 つの T と 2 つの空白があり、C がありません。したがって、この行を削除する必要があります。

2 行目では、REF は 'C' で、Alt は '-' です。7 つのサンプルには、3 つの C、2 つの '-'、および 2 つのブランクがあります。したがって、この行を C として保持し、- を 2 回以上繰り返しました。数えるときは常に空白を無視します

フィルタリング後の最終ファイルは

#CHROM   POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

列を配列に読み込んでコードに表示することはできますが、ループを開始してベースを読み取り、その出現回数を数えて列のままにする方法がわかりません。これをどのように進めるべきか誰か教えてもらえますか? または、修正できるサンプルコードがあれば助かります。

score 2 · Accepted Answer

#!/usr/bin/env perl
use strict;
use warnings;

print scalar(<>);                   # Read and output the header.

while (<>) {                        # Read a line.
   chomp;                           # Remove the newline from the line.
   my ($chrom, $pos, $ref, $alt, @samples) =
      split /\t/;                   # Parse the remainder of the line.

   my %counts;                      # Count the occurrences of sample values.
   ++$counts{$_} for @samples;      # e.g. Might end up with $counts{"G"} = 3.

   print "$_\n"                     # Print line if we want to keep it.
      if ($counts{$ref} || 0) >= 2  # ("|| 0" avoids a spurious warning.)
      && ($counts{$alt} || 0) >= 2;
}

出力：

CHROM    POS     REF     ALT    10_sample.bam   11_sample.bam   12_sample.bam   13_sample.bam   14_sample.bam   15_sample.bam   16_sample.bam 
tg93    79  C   -   C       C   C   -   -   
tg93    80  G   A   G   G   G   G   A   A   G
tg93    81  A   C   A   A   A   A   C   C   C
tg93    86  C   A   C   C   A   A   A   A   C
tg93    136 G   C   C   G   C   C   G   G   G

目的の出力に 108 を含めましたが、7 つのサンプルで ALT のインスタンスが 1 つしかありません。

使用法：

perl script.pl file.in >file.out

またはインプレース:

perl -i script.pl file

score 0 · Accepted Answer

これは、フィールド間のタブ区切りを想定していないアプローチです

use IO::All;
my $chrom = "tg93";
my @lines = io('file.txt')->slurp;
foreach(@lines) {
    %letters = ();

    # use regex with backreferences to extract data - this method does not depend on tab separated fields
    if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {

        # initialize hash counts
        $letters{$1} = 0;
        $letters{$2} = 0;

        # loop through the samples and increment the counter when matches are found
        foreach($3, $4, $5, $6, $7, $8, $9) {
            if ($_ eq $1) {
                ++$letters{$1};
            }
            if ($_ eq $2) {
                ++$letters{$2};
            }
        } 

        # if the counts for both POS and REF are greater than or equal to 2, print the line
        if($letters{$1} >= 2 && $letters{$2} >= 2) {
            print $_;
        }
    }
}

perl - タブ区切りファイルを読み取り、出現回数を数えて行を削除する

2 に答える 2

Related

Reference