python - Python を使用して、ドットが 1 つある行とドットが 2 つある行を区別する

Question

特定の方法でフォーマットしたい大きなファイルがあります。ファイル入力例:

DVL1    03220   NP_004412.2 VANGL2  02758   Q9ULK5  in vitro    12490194
PAX3    09421   NP_852124.1 MEOX2   02760   NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254.1  in vitro;in vivo    15195140

そして、これが私がなりたい方法です：

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

要約する：

行に 1 つのドットがある場合、そのドットはその後の数値とともに削除され、\t が追加されるため、出力行には 6 つのタブ区切りの値のみが含まれます。
行に 2 つのドットがある場合、それらのドットはその後の数字とともに削除され、\t が追加されるため、出力行には 6 つのタブ区切りの値のみが含まれます。
行にドットがない場合は、最初の 6 つのタブ区切りの値を維持します

私の考えは現在、次のようなものです。

for line in infile:
    if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
        transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
        columns = transformed_line.split('\t')
        outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
    else:
        columns = line.split('\t')
        outfile.write('\t'.join(columns[:5]) + '\n') # this is fine

私が自分自身をうまく説明したことを願っています。皆さんの努力に感謝します。

score 3 · Accepted Answer

import re
with open(filename,'r') as f:
    newlines=(re.sub(r'\.\d+','',old_line) for old_line in f)
    newlines=['\t'.join(line.split()[:6]) for line in newlines]

これで、「.number」部分が削除された行のリストができました。私が知る限り、あなたの問題は、正規表現を使用した1回のパスでこのすべてを機能させるのに十分な制約がありませんが、2回で機能します.

score 2 · Accepted Answer

次のようなことを試すことができます：

    with open('data1.txt') as f:
        for line in f:
            line=line.split()[:6]
            line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)  #if an element has '.' then
                                                                         #remove that dot else keep the element as it is
            print('\t'.join(line))

出力：

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

編集：

@mgilsonが提案したように、行line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)は単純に置き換えることができますline=map(lambda x:x.split('.')[0],line)

score 1 · Accepted Answer

私は誰かが単一の正規表現でこれを行うべきだと考えたので...

import re
beast_regex = re.compile(r'(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+in.*')
with open('data.txt') as infile:
    for line in infile:
        match = beast_regex.match(line)
        print('\t'.join(match.groups())

score 0 · Accepted Answer

簡単な正規表現でこれを行うことができます：

import re
for line in infile:
    line=re.sub(r'\.\d+','\t',line)
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n')

これは「.」を置き換えます。タブ文字を含む 1 つ以上の数字が続きます。

python - Python を使用して、ドットが 1 つある行とドットが 2 つある行を区別する

4 に答える 4

Related

Reference