perl - .fasta シーケンスを読み取ってヌクレオチドデータを抽出し、TabDelimited ファイルに書き込む

Question

先に進む前に、Perl に関する以前の問題を読者に紹介したいと思います。これらすべての初心者です。

これらは、過去数日間の私の投稿で、時系列順です。

上で述べたように、数人の皆さんの助けのおかげで、最初の 2 つのクエリを理解することができ、そこから多くのことを学びました。本当に感謝しています。これについて何も知らず、まだ知らないように感じている人にとって、この助けは事実上天の恵みでした.

最後のクエリは未解決のままで、これは続きです。おすすめのテキストをいくつか見ましたが、月曜日までに終わらせようとしているので、何かを完全に見落としていないかどうかはわかりません。いずれにせよ、私はその仕事を試みました。

ご存知のように、タスクは.fasta ファイルを開いて読み取り(最終的にかなりうまくいったと思います、ハレルヤ!)、各配列を読み取り、相対的な G+C ヌクレオチド含有量を計算し、次にa に書き込みます。 TAB区切りファイルと遺伝子の名前とそれぞれのG + Cコンテンツ。

私はこれを試してみましたが、プログラムを実行して目的の結果を得る準備ができていないことを知っています。、またはこれを行う方法の例。以前の解決済みのクエリと同様に、最も便利で効率的な方法ではないかもしれませんが、既に実行したものと同様のスタイルにしたいと考えています。スパムを送信しているように見えても、各ステップで自分が何をしているのかを知ることができます!

とにかく、.fasta ファイルは次のようになります。

>label
sequence
>label
sequence
>label
sequence

.fasta ファイルを開く方法がわからないので、どのラベルがどのラベルに適用されるかはわかりませんが、遺伝子にはgag、pol、またはenv. 何をしているのかを知るために .fasta ファイルを開く必要がありますか、それとも上記の形式を使用して「盲目的に」行うことができますか?

それは完全に明白かもしれませんが、私はまだこれらすべてに苦労しています。私は今までに追いついておけばよかったと感じています！

とにかく、私が持っている現在のコードは次のとおりです。

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    foreach my $line ($infile) {
        if($line = ~/^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
            next;
        } elsif($line = ~/^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
            next; 
        } elsif($line = ~/^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
            next;
        } else {
            $sequence = $line;
        }
    }
    {
        $sequence =~ s/\s//g;               # Whitespace characters are removed
        return $sequence;
    }

ここで何かが正しいかどうかはわかりませんが、実行すると35行目で構文エラーが発生しました(最後の行を超えているため、何もありません!)。それは「EOF」で言った。私が指摘できるのはこれくらいです。それ以外の場合は、各シーケンスのヌクレオチド G + C の量を計算する方法を理解しようとしており、出力 .txt ファイルでこれを適切に集計しています。それが TABDelimited ファイルの意味だと思いますか?

いずれにせよ、このクエリが長すぎる、「ばかげている」、または繰り返しのように思われる場合は申し訳ありませんが、そうは言っても、これに直接関連する情報は見つかりませんでした。できれば各ステップも!!

親切。

score 2 · Accepted Answer

終わりのすぐ近くに余分なブレースがあります。これは機能するはずです：

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ /^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

また、私はあなたのリターンラインを編集しました。Returnはループを終了します。あなたが望むのはそれをファイルに印刷することだと思うので、私はそれをしました。タブ区切り形式にするには、最初にさらに変換を行う必要がある場合があります。

perl - .fasta シーケンスを読み取ってヌクレオチド データを抽出し、TabDelimited ファイルに書き込む

1 に答える 1

Related

Reference

perl - .fasta シーケンスを読み取ってヌクレオチドデータを抽出し、TabDelimited ファイルに書き込む