python - getting a gene sequence from entrez using biopython

Question

This is what I want to do. I have a list of gene names for example: [ITGB1, RELA, NFKBIA]

Looking up the help in biopython and tutorial for API for entrez I came up with this:

x = ['ITGB1', 'RELA', 'NFKBIA']
for item in x:
    handle = Entrez.efetch(db="nucleotide", id=item ,rettype="gb")
    record = handle.read()
    out_handle = open('genes/'+item+'.xml', 'w') #to create a file with gene name
    out_handle.write(record)
    out_handle.close

But this keeps erroring out. I have discovered that if the id is a numerical id (although you have to make it in to a string to use, '186972394' so:

handle = Entrez.efetch(db="nucleotide", id='186972394' ,rettype="gb")

This gets me the info I want which includes the sequence.

So now to the Question: How can I search gene names (cause I do not have id numbers) or easily convert my gene names to ids to get the sequences for the gene list I have.

Thank you,

score 4 · Accepted Answer

最初に遺伝子名を付けます例：ATK1

item = 'ATK1'
animal = 'Homo sapien' 
search_string = item+"[Gene] AND "+animal+"[Organism] AND mRNA[Filter] AND RefSeq[Filter]"

これで、IDを検索するための検索文字列ができました。

handle = Entrez.esearch(db="nucleotide", term=search_string)
record = Entrez.read(handleA)
ids = record['IdList']

これは、IDが[]であることが見つからない場合、およびIDがリストとして返されます。ここで、リスト内の1つのアイテムを返すと仮定します。

seq_id = ids[0] #you must implement an if to deal with <0 or >1 cases
handle = Entrez.efetch(db="nucleotide", id=seq_id, rettype="fasta", retmode="text")
record = handleA.read()

これにより、ファイルに保存できるfasta文字列が得られます

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

score 0 · Accepted Answer

チュートリアルのセクション8.3を見ると、用語を検索して対応するIDを取得できる関数があるようです（このライブラリについては何も知らず、生物学についても知らないので、これは完全に間違っている可能性があります:)）。

>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")
>>> record = Entrez.read(handle)
>>> record["Count"]
'25'
>>> record["IdList"]
['126789333', '37222967', '37222966', '37222965', ..., '61585492']

私の知る限り、関数idによって返される実際のID番号を参照します（応答の属性内）。ただし、キーワードを使用すると、代わりに検索を実行して、一致したアイテムのIDを取得できます。完全にテストされていませんが、検索がブール演算子をサポートしていると仮定すると（動作しているように見えます）、次のようなクエリを使用してみることができます。esearchIdListtermAND

>>> handle = Entrez.esearch(db="nucleotide",term="ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]")
>>> record = Entrez.read(handle)
>>> record["IdList"]
# Hopefully your ids here...

挿入する用語を生成するには、次のようにします。

In [1]: l = ['ITGB1', 'RELA', 'NFKBIA']

In [2]: ' OR '.join('%s[Gene]' % i for i in l)
Out[2]: 'ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]'

次にrecord["IdList"]、をコンマ区切りの文字列に変換し、次のidようなものを使用して元のクエリの引数に渡すことができます。

In [3]: r = ['1234', '5678', '91011']

In [4]: ids = ','.join(r)

In [5]: ids
Out[5]: '1234,5678,91011'

python - getting a gene sequence from entrez using biopython

2 に答える 2

Related

Reference