perl - 大きなファイルを Perl のハッシュにロードする (BLAST テーブル)

Question

私は perl の初心者です。クエリを手伝ってください...ブラストテーブルから情報を抽出しようとしています (以下はそのように見えます): これは標準のブラストテーブル入力です...基本的に、読み取りのリストに関する情報を抽出したい (下の 2 番目のスクリプトを見て、私が何をしたいのかを理解してください)... とにかく、これはまさに 2 番目のスクリプトで行ったことです:

入力:

1) ブラストテーブル:

38.1    0.53    59544   GH8NFLV01A02ED  GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345  1   YP_003242370    Dynamin family protein [Paenibacillus sp. Y412MC10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590
34.3    7.5 123828  GH8NFLV01A03QJ  GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1   XP_002639994    Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3   0   52.1739130434783    32.6086956521739    46  452 367 10.1769911504425    12.5340599455041    111 248 79  124
37.7    0.70    62716   GH8NFLV01A09B8  GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306
37.7    0.98    33114   GH8NFLV01A0H5C  GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   -3  0   73.5294117647059    52.9411764705882    34  573 703 5.93368237347295    4.83641536273115    131 232 273 306
103 1e-020  65742   GH8NFLV01A0MXI  GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475  1   ABZ08973    hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14]  2   0   77.9411764705882    77.9411764705882    68  475 151 14.3157894736842    45.0331125827815    2   205 1   68
41.6    0.053   36083   GH8NFLV01A0QKX  GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1   XP_766153   hypothetical protein [Theileria parva strain Muguga]    -1  0   66.6666666666667    56.6666666666667    30  526 304 5.70342205323194    9.86842105263158    392 481 31  60
45.4    0.003   78246   GH8NFLV01A0Z29  GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1   ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1]   3   0   51.8518518518518    38.8888888888889    54  432 193 12.5    27.979274611399 48  209 97  150
71.6    4e-011  97250   GH8NFLV01A14MR  GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314  1   ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244]  1   0   92.5    92.5    40  314 311 12.7388535031847    12.8617363344051    193 312 13  52
58.2    5e-007  154555  GH8NFLV01A1KCH  GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1   ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244]  1   0   82.051282051282 82.051282051282 39  267 311 14.6067415730337    12.540192926045 142 258 1   39

2) 読み取りリスト:

GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc

3）私が望む出力：

37.7    0.70    62716   GH8NFLV01A09B8  GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1   XP_002756773    PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus]   1   0   73.5294117647059    52.9411764705882    34  512 703 6.640625    4.83641536273115    43  144 273 306
38.1    0.53    59544   GH8NFLV01A02ED  GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345  1   YP_003242370    Dynamin family protein [Paenibacillus sp. Y412MC10] -1  0   48.936170212766 40.4255319148936    47  345 1213    13.6231884057971    3.87469084913438    31  171 544 590

抽出したい読み取り名のリスト (4 番目の列にあります) が与えられた場合、最初のリストの情報のサブセットが必要です。読み取りリストをハッシュする代わりに (のみ?)、ブラストテーブル自体をハッシュしたいのですが、列 4 の情報を (ブラストテーブルの) 列 4 の情報をキーとして使用して、各キーの値を抽出します。これは、そのキーが複数の値を持つ場合でも (つまり、各読み取り名が実際には複数のヒットを持っているか、関連付けられている可能性があります)。 blast 結果はテーブルに表示されます)、値にはそのキー (readname) を含む WHOLE 行が含まれることに注意してください。

私のgreplist.plスクリプトはこれを行いますが、非常に遅いと思います（間違っていれば訂正してください）、テーブル全体をハッシュにロードすることで、これは非常に高速になるはずです...

ご協力ありがとうございました。

私のスクリプト: 壊れたもの (mambo5.pl)

#!/usr/bin/perl -w
# purpose:  extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");

my $readName;

while ( <LIST> )
{
    #########;
    if(/^(.*?)$/)#
    {
        $readName=$1;#
        chomp $readName;
        if (exists $hash{$readName})
        {
            print "bingo!";
            my $output =$hash{$readName};
            print OUT "$output\n";
        }
        else 
        {
            print "it aint workin\n";
            #print %hash;
        }           
    }
}
close (LIST);

スローアンドクイックチート (これは機能します) は非常に低速です (私のブラストテーブルは約 400MB から 2GB になる可能性があります。なぜそんなに遅いのかわかるはずです)。

#!/usr/bin/perl -w
## 
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script

use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
    if ($line=~/^(.*)$/) 
    {
        $read = $1;
        print "$read\n";
        system("grep \"$read\" $db >$read\_.out\n");
    }


    #system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");

その 4 番目の列をキーとして定義する方法がわかりません。分割機能を使用できるかもしれませんが、2 列を超えるテーブルに対してこれを行う方法を見つけようとしましたが、役に立ちませんでした...お願いしますヘルプ！簡単に外れる方法があれば教えてください

ありがとう！

score 4 · Accepted Answer

私は反対のことをします。つまり、readlist ファイルをハッシュに読み込み、大きなブラストファイルを調べて、目的の行を出力します。

#!/usr/bin/perl 
use strict;
use warnings;
use 5.010;

# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;

open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
    chomp;
    # retrieve the key (4th column)
    my ($key) = (split/\s+/)[3];
    # print the line if the key exists in the hash
    say $_ if exists $readslist{$key};
}
close $fh_blast;

score 0 · Accepted Answer

出来上がり、これを行う2つの方法、1つはperlとは何の関係もありません：

awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table

Mambo6.pl

#!/usr/bin/perl -w
# purpose:  extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time. 
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
    #chomp;
    if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
    {
        #print "$1\n";
        $key= $5;#read
        $val= $1;#whole row; notice the brackets around the whole match.
        $hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
    }
    else {
        print "something wrong with format";
    }
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");

my $readName;

while ( <LIST> )
{
    #########;
    if(/^(.*?)$/)#
    {
        $readName=$1;#
        chomp $readName;
        if (exists $hash{$readName})
        {
            print "$readName\n";
            my $output =$hash{$readName};
            print OUT "$output";
        }
        else 
        {
            #print "it aint workin\n";
        }           
    }
}
close (LIST);
close (OUT);

ワンライナーはより速く、おそらく私のスクリプトよりも優れているので、もっと簡単な方法を見つけることができる人もいると思います...私が望むことを実行するので、これを立てると思いました。

score 0 · Accepted Answer

ブラストファイルを一時的にインデックス付きシーケンシャルファイルに変換するインデックスを作成することをお勧めします。それを読み、各キーのすべてのレコードが始まるファイル内のアドレスのハッシュを作成します。

その後、seek必要なレコードを取得するために、ファイル内の正しい場所に移動するだけです。これは、大きなファイルを一度だけ読み取る必要があるため、ほとんどの単純なソリューションよりも確実に高速です。このコード例は示しています。

use strict;
use warnings;

use Fcntl qw/SEEK_SET/;

my %index;

open my $blast, '<', 'blast.txt' or die $!;

until (eof $blast) {
  my $place = tell $blast;
  my $line = <$blast>;
  my $key = (split ' ', $line, 5)[3];
  push @{$index{$key}}, $place;
}

open my $reads, '<', 'reads.txt' or die $!;

while (<$reads>) {

  next unless my ($key) = /(\S+)/;
  next unless my $places = $index{$key};

  foreach my $place (@$places) {
    seek $blast, $place, SEEK_SET;
    my $line = <$blast>;
    print $line;
  }
}

perl - 大きなファイルを Perl のハッシュにロードする (BLAST テーブル)

3 に答える 3

Related

Reference