4

私はperlを使い始めたばかりで、質問があります。PHYLIPファイルがあり、FASTAに変換する必要があります。スクリプトを書き始めます。最初に、行のscpacを削除しました。次に、すべての行を整列させる必要があります。すべての行で60アミノ酸であり、シーケンス識別子が新しい行に印刷される必要があります。多分誰かが私にいくつかのアドバイスを与えることができますか?

4

2 に答える 2

6

BioPerl Bio::AlignIOモジュールが役立つかもしれません。PHYLIPシーケンス形式をサポートしています。

phylip2fasta.pl

use strict;
use warnings;
use Bio::AlignIO; 

# http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html
# http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html
# http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format

my ($inputfilename) = @ARGV;
die "must provide phylip file as 1st parameter...\n" unless $inputfilename;
my $in  = Bio::AlignIO->new(-file   => $inputfilename ,
                         -format => 'phylip',
                         -interleaved => 1);
my $out = Bio::AlignIO->new(-fh   => \*STDOUT ,
                         -format => 'fasta');

while ( my $aln = $in->next_aln() ) {
    $out->write_aln($aln);
}

$ perl phylip2fasta.pl test.phylip

>Turkey/1-42
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>Salmo_gair/1-42
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>H._Sapiens/1-42
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>Chimp/1-42
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>Gorilla/1-42
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

test.phylip http://evolution.genetics.washington.edu/phylip/doc/sequence.html

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA
于 2013-03-14T02:22:20.893 に答える
1

BioPerl にアクセスできる場合は、それを使用することをお勧めします (他の回答を参照)。そうでない場合は、数年前に古いハードウェアの割り当てで使用した簡単なスクリプトを次に示します。それはあなたのために働くかもしれません。

1 つの注意点: fasta シーケンス全体を 1 行に出力するため、最後の print ステートメントを編集して、1 行あたり 70 AA を出力する必要があります。

#!/usr/bin/perl

use warnings;
use strict;

<DATA> =~ /(\d+)/; # first number is number of species

my $num_species = $1;
my $i = 0;
my @species;
my @acids;

# first $num_species rows have the species name
for ($i = 0; $i < $num_species; $i++) {   
    my @line = split /\s+/, <DATA>;
    chomp @line;

    push @species, shift (@line);
    push @acids, join ("", @line);

}

# Get the rest of the AAs
$i = 0;
while (<DATA>) {
    chomp;
    $_ =~ s/\r//g; #remove \r

    next if !$_;

    $_ =~ s/\s+//g; #remove spaces
    $acids[$i] .= $_;
    $i = ++$i % $num_species;

}

# Print them
for ($i = 0; $i < $num_species; $i++) {
    print "> ", $species[$i], "\n";

    # uncomment next line if you want to remove the gaps ("-")
    $acids[$i] =~ s/-//g;
    print $acids[$i], "\n\n";
}

# Simple PHYLIP Amino Acid file
__DATA__
 10 234
Cow          MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Carp         MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL
Chicken      MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL
Human        MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL
Loach        MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL
Mouse        MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL
Rat          MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL
Seal         MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL
Whale        MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL
Frog         MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL

             THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
             TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM
             S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI
             TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI
             TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM
             THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM
             THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM
             THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM
             THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM
             TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI

             GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI
             GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV
             GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI
             GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI
             GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI
             GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
             GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI
             GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI
             GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI
             GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT

             RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC
             RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC
             RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC
             RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC
             RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC
             RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
             RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC
             RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC
             RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC
             RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC

             GSNHSFMPIV LELVPLKYFE KWSASML--- ----
             GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS
             GANHSYMPIV VESTPLKHFE AWSSL----- -LSS
             GANHSFMPIV LELIPLKIFE M-------GP VFTL
             GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS
             GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
             GSNHSFMPIV LEMVPLKYFE NWSASMI--- ----
             GSNHSFMPIV LELVPLSHFE KWSTSML--- ----
             GSNHSFMPIV LELVPLEVFE KWSVSML--- ----
             GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL--

出力:

> Cow
MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML

> Carp
MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS

> Chicken
MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS

> Human
MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL

> Loach
MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS

> Mouse
MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI

> Rat
MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI

> Seal
MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML

> Whale
MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML

> Frog
MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL
于 2013-03-14T15:29:30.877 に答える