perl - perlを使用して、あるタブ区切りファイルから別のファイルにデータを解析します

Question

私はこのようなタブ区切りファイルを持っています（私のスクリプトDIVERGEに）：

contig04730 contigK02622 0.3515
contig04733 contigK02622 0.3636
contig14757 contigK03055 0.4

そして、私はこのような2番目のタブ区切りファイル（DATA）を持っています：

contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap
contig04730 F GO:0005528 reproduction GO:0001113 eggs
contig14757 P GO:0123456 immune GO:0003456 cells
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding
contig14757 C GO:0000001 immune GO:00066669 more_cells

最初のファイルの2番目と3番目の列を2番目のファイルに追加して、（OUT）を取得しようとしています。

contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4

これは私が使用しようとしているperlスクリプトです（ここで見つけたものを適応させようとしています-perlには非常に新しいです）：

#!/usr/bin/env/perl

use strict;
use warnings;

#open the ortholog contig list
open (DIVERGE, "$ARGV[0]") or die "Error opening the input file with contig pairs";

#hash to store contig IDs
my ($espr, $liya, $divergence) = split("\t", $_);

#read through the ortho contig list and read into memory
while(<DIVERGE>){
    chomp $_;   #get rid of ending whitepace
    ($espr, $liya, $divergence)->{$_} = 1;
}
close(DIVERGE);

#open output file
open(OUT, ">$ARGV[2]") or die "Error opening the output file";

#open data file
open(DATA, "$ARGV[1]") or die "Error opening the sequence pairs file\n";

while(<DATA>){
    chomp $_;

    my ($contigs, $FPC, $GOslim, $slimdesc, $GOterm, $GOdesc) = split("\t", $_);
    if (defined $espr->{$contigs}) {
        print OUT "$_", "\t$liya\t$divergence", "\n";
    }
}
close(DATA);
close(OUT);

しかし、15行目でのプライベート変数の無用な使用と10行目の分割での単一化された値_ $についてエラーが発生しています。私は、perlの用語/変数の非常に基本的な把握しか持っていません。ですから、私がどこで間違っているのか、そしてどのように修正するのかを誰かが指摘できれば、それは大いにありがたいです。

score 3 · Accepted Answer

Text::CSVこれは、モジュールを使用する機会です。もちろん、csvデータに適切なパーサーを使用する利点は、エッジケースがデータを壊さないようにすることです。

use strict;
use warnings;
use Text::CSV;

my $div     = "diverge.txt";   # you can also assign dynamical names, e.g.
my $data    = "data.txt";      # my ($div, $data) = @ARGV
my $csv     = Text::CSV->new({
            binary      => 1,
            eol     => $/,
            sep_char    => "\t",
        });
my %div;

open my $fh, "<", $div or die $!;

while (my $row = $csv->getline($fh)) {
    my $key = shift @$row;              # first col is key
    $div{$key} = $row;                  # store row entries 
}
close $fh;

open $fh, "<", $data or die $!;

while (my $row = $csv->getline($fh)) {
    my $key = $row->[0];                # first col is key (again)
    push @$row, @{ $div{$key} };        # add stored values to $row
    $csv->print(*STDOUT, $row);         # print using Text::CSV's method
}

出力：

contig04730     F       GO:0000228      nuclear GO:0000783      telomere_cap contigK02622    0.3515
contig04730     F       GO:0005528      reproduction    GO:0001113      eggs    contigK02622    0.3515
contig14757     P       GO:0123456      immune  GO:0003456      cells   contigK03055    0.4
contig14757     P       GO:0000782      nuclear GO:0001891      DNA_binding    contigK03055    0.4
contig14757     C       GO:0000001      immune  GO:00066669     more_cells    contigK03055    0.4

タブで区切られているため出力が異なって見えるのに対し、質問ではスペースで区切られていることに注意してください。

score 2 · Accepted Answer

私がすること：

#!/usr/bin/env perl

use strict; use warnings;

open my $fh1, "<", "file1" or die $!;
open my $fh2, "<", "file2" or die $!;

my %hash;

while (<$fh1>) {
    chomp;
    my @F = split;
    $hash{$F[0]} = join "\t", @F[1..2];
}

while (<$fh2>) {
    chomp;
    my @F = split;
    print join("\t", $_, $hash{$F[0]}), "\n";
}

close $fh1;
close $fh2;

出力

contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap        contigK02622    0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs   contigK02622    0.3515
contig14757 P GO:0123456 immune GO:0003456 cells        contigK03055    0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055    0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells  contigK03055    0.4

score 2 · Accepted Answer

これは（私があなたの意図を正しく理解していれば）コマンドjoin：によって（少なくともLinuxでは）1行で実行できます。

 $ cat DATA
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap
contig04730 F GO:0005528 reproduction GO:0001113 eggs
contig14757 P GO:0123456 immune GO:0003456 cells
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding
contig14757 C GO:0000001 immune GO:00066669 more_cells

 $ cat DIVERGE
contig04730 contigK02622 0.3515
contig04733 contigK02622 0.3636
contig14757 contigK03055 0.4

 $ join DATA DIVERGE
contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells contigK03055 0.4

score 1 · Accepted Answer

別のオプションは次のとおりです。

use strict;
use warnings;

my $data = pop;
my %diverge = map { /(\S+)\t+(.+)/; $1 => $2 } <>;
push @ARGV, $data;

while (<>) {
    chomp;
    $_ .= "\t$diverge{$1}\n" if /(\S+)/ and $diverge{$1};
    print;
}

使用法：perl DIVERGE_File DATA_File [>outFile]

データセットへの出力：

contig04730 F GO:0000228 nuclear GO:0000783 telomere_cap    contigK02622 0.3515
contig04730 F GO:0005528 reproduction GO:0001113 eggs   contigK02622 0.3515
contig14757 P GO:0123456 immune GO:0003456 cells    contigK03055 0.4
contig14757 P GO:0000782 nuclear GO:0001891 DNA_binding contigK03055 0.4
contig14757 C GO:0000001 immune GO:00066669 more_cells  contigK03055 0.4

perl - perlを使用して、あるタブ区切りファイルから別のファイルにデータを解析します

4 に答える 4

出力

Related

Reference