python - フォーマット変更による pdb ファイルの解析

Question

次のようなファイルがあります。

ATOM   7748  CG2 ILE A 999      53.647  54.338  82.768  1.00 82.10           C  
ATOM   7749  CD1 ILE A 999      51.224  54.016  84.367  1.00 83.16           C  
ATOM   7750  N   ASN A1000      55.338  57.542  83.643  1.00 80.67           N  
ATOM   7751  CA  ASN A1000      56.604  58.163  83.297  1.00 80.45           C  
ATOM   7752  C   ASN A1000      57.517  58.266  84.501  1.00 80.30           C

ご覧のとおり、" " は 4 列目と 5 列目の間で消えます (カウントは 0 から始まります)。したがって、以下のコードは失敗します。私はPythonを初めて使います（合計時間は3日です！）、これを処理する最良の方法は何だろうと思っていました。スペースがある限り、line.split() は機能します。文字数を数えてから、文字列を絶対参照で解析する必要がありますか?

import string
visited = {}
outputfile = open(file_output_location, "w")
for line in open(file_input_location, "r"):
    list = line.split()
    id = list[0]
    if id == "ATOM":
        type = list[2]
        if type == "CA":
            residue = list[3]
            if len(residue) == 4:
                residue = residue[1:]
            type_of_chain = list[4]
            atom_count = int(list[5])
            position = list[6:9]
            if(atom_count >= 1):
                if atom_count not in visited and type_of_chain == chain_required:
                    visited[atom_count] = 1
                    result_line = " ".join([residue,str(atom_count),type_of_chain," ".join(position)])
                    print result_line
                    print >>outputfile, result_line
outputfile.close()

score 1 · Accepted Answer

文字列スライスを使用します。

print '0123456789'[3:6]
345

そこには非対称性があります。最初の数字は、必要な最初の文字の 0 ベースのインデックスです。2 番目の数値は、不要になった最初の文字の 0 から始まるインデックスです。

score 1 · Accepted Answer

PDB ファイルは、スペースで区切られたものではなく、列幅が固定されたファイルのように見えます。したがって、それらを手動で解析する必要がある場合 ( pdb-toolsなどの既存のツールを使用するのではなく)、次の行に沿ってさらに何かを使用してラインナップを切り刻む必要があります。

id = line[0:4]
type = line[4:9].strip()
# ad nausium

score 0 · Accepted Answer

PDBを解析するモジュールがあるため、Biopythonをインストールする価値があるかもしれません。

サンプルデータで次のコードを使用しました。

from Bio.PDB.PDBParser import PDBParser

pdb_reader = PDBParser(PERMISSIVE=1)
structure_id="Test"
filename="Test.pdb" # Enter file name here or path to file.
structure = pdb_reader.get_structure(structure_id, filename)

model = structure[0]

for chain in model: # This will loop over every chain in Model
    for residue in chain:
        for atom in residue:
            if atom.get_name() == 'CA': # get_name strips spaces, use this over get_fullname() or get_id()
                print atom.get_id(), residue.get_resname(), residue.get_id()[1], chain.get_id(), atom.get_coord() 
                # Prints Atom Name, Residue Name, Residue number, Chain Name, Atom Co-Ordinates

これは印刷されます：

CA ASN 1000 A [ 56.60400009  58.1629982   83.29699707]

次に、14本の鎖（1aon.pdb）を持つより大きなタンパク質で試してみましたが、正常に機能しました。

python - フォーマット変更による pdb ファイルの解析

3 に答える 3

Related

Reference