python - PythonでテキストからIDを解析する

Question

私はこのテキストを持っています：

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

このテキストから、|gb| の後に続く ID を解析したいと思います。そしてリストに書きます。

正規表現を使用しようとしていますが、うまく実行できませんでした。

score 3 · Accepted Answer

パイプで分割し、最初の;|まですべてをスキップします。gb次の要素は ID です。

from itertools import dropwhile

text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)

デモンストレーション：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'

つまり、正規表現は必要ありません。

これをジェネレーターメソッドにして、すべての ID を取得します。

from itertools import dropwhile

def extract_ids(text):
    text = iter(text.split('|'))
    while True:
        next(dropwhile(lambda s: s != 'gb', text))
        yield next(text)

これは与える：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']

または、単純なループで使用できます。

for id in extract_ids(text):
    print id

score 2 · Accepted Answer

2

正規表現が機能するはずです

import re
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')

于 2013-02-13T20:59:40.487 に答える

score 1 · Accepted Answer

この場合、正規表現なしで取得できます。'|gb|' で分割し、2 番目の部分を '|' で分割します。そして最初の項目を取ります:

s = 'the string from the question'
r = s.split('|gb|')
r.split('|')[0]

もちろん、最初の分割が2つ以上のアイテムを含むリストを返すかどうかのチェックを追加する必要がありますが、正規表現を使用するよりも高速になると思います。

score 1 · Accepted Answer

>>> import re
>>> match_object = re.findall("\|gb\|(.*?)\|", ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]")
>>> print match_object
['EDL26483.1', 'AAI37799.1']

正規表現は、「任意の文字 (.) に繰り返し (*) 一致しますが、可能な限り少ない数 (?) に一致し、そのグループ (括弧) のみを保存します。それらは '|gb|' の直後に来る必要があります。および別の '|' の直前。"

「\|」を使用しました「|」character は、正規表現での代替一致を示します。

score 0 · Accepted Answer

aが文字列を保持する変数であると仮定します...

>>> import re
>>> a = ">gi|124486857|ref|NP_001074751.1| ..."
>>> re.findall(r"(?:\|gb\|)([a-zA-Z0-9.]+)(?:\|)", a)
['EDL26483.1', 'AAI37799.1']

score 0 · Accepted Answer

In [1]: import re

In [2]: text = ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]"

In [3]: re.findall(r'gb\|([^\|]+)', text)[0]
Out[3]: 'EDL26483.1'

score 0 · Accepted Answer

re.findall('gi\|([0-9]+)\|', u'''>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]''')私のために働く： [u'124486857', u'341941060', u'148694536', u'223460980']

python - PythonでテキストからIDを解析する

7 に答える 7

Related

Reference