python - FASTAファイルからPythonのリストに複数のシーケンスを追加する

Question

複数のシーケンスでファイルを整理しようとしています。そうすることで、名前をリストに追加し、シーケンスを名前リストと並行する別のリストに追加しようとしています。リストに名前を追加する方法はわかりましたが、それに続くシーケンスを別のリストに追加する方法がわかりません。シーケンスの行を空の文字列に追加しようとしましたが、すべてのシーケンスのすべての行を1つの文字列に追加しました。

すべての名前は「>」で始まります

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print data

シーケンスを文字列のセットとしてリストに追加するにはどうすればよいですか？

入力ファイルは次のようになります

ここに画像の説明を入力してください

score 4 · Accepted Answer

Pythonおよびfastaファイルを使用している場合は、BioPythonのインストールを検討することをお勧めします。すでにこの解析機能が含まれており、さらに多くの機能が含まれています。

fastaファイルの解析は次のように簡単です。

from Bio import SeqIO
for record in SeqIO.parse('filename.fasta', 'fasta'):
    print record.id, record.seq

score 1 · Accepted Answer

次のように、マーカーラインをヒットしたときに文字列をリセットする必要があります。

def Name_Organizer(FASTA,output):

    import os
    import re

    in_file=open(FASTA,'r')
    dir,file=os.path.split(FASTA)
    temp = os.path.join(dir,output)
    out_file=open(temp,'w')

    data=''
    name_list=[]
    seq_list=[]

    for line in in_file:

        line=line.strip()
        for i in line:
            if i=='>':
                name_list.append(line)
                if data:
                    seq_list.append(data)
                    data=''
                break
            else:
                line=line.upper()
        if all([k==k.upper() for k in line]):
            data=data+line

    print seq_list

もちろん、継続的に追加するよりも、文字列結合を使用する方が（ファイルのサイズによっては）高速になる場合もあります。

data = []

# ...

data.append(line) # repeatedly

# ...

seq_list.append(''.join(data)) # each time you get to a new marker line
data = []

score 0 · Accepted Answer

最初に辞書にまとめました

# remove white spaces from the lines
lines = [x.strip() for x in open(sys.argv[1]).readlines()]
fasta = {}
for line in lines:
    if not line:
        continue
    # create the sequence name in the dict and a variable
    if line.startswith('>'):
        sname = line
        if line not in fasta:
            fasta[line] = ''
        continue
    # add the sequence to the last sequence name variable
    fasta[sname] += line
# just to facilitate the input for my function
lst = list(fasta.values())

python - FASTAファイルからPythonのリストに複数のシーケンスを追加する

3 に答える 3

Related

Reference