python - Python で 1 つのファイルから複数の辞書へ

Question

特別な種類のファイルを入力として受け取る Python スクリプトを作成しようとしています。
このファイルには複数の遺伝子に関する情報が含まれており、1 つの遺伝子に関する情報が複数の行にまたがって記述されており、行数は遺伝子ごとに同じではありません。例は次のとおりです。

 gene            join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /db_xref="GeneID:5685236"
 CDS             join(373616..374161,1..174)
                 /locus_tag="AM1_A0001"
                 /codon_start=1
                 /transl_table=11
                 /product="glutathione S-transferase, putative"
                 /protein_id="YP_001520660.1"
                 /db_xref="GI:158339653"
                 /db_xref="GeneID:5685236"
                 /translation="MKIVSFKICPFVQRVTALLEAKGIDYDIEYIDLSHKPQWFLDLS
                 PNAQVPILITDDDDVLFESDAIVEFLDEVVGTPLSSDNAVKKAQDRAWSYLATKHYLV
                 QCSAQRSPDAKTLEERSKKLSKAFGKIKVQLGESRYINGDDLSMVDIAWLPLLHRAAI
                 IEQYSGYDFLEEFPKVKQWQQHLLSTGIAEKSVPEDFEERFTAFYLAESTCLGQLAKS
                 KNGEACCGTAECTVDDLGCCA"
 gene            241..381
                 /locus_tag="AM1_A0002"
                 /db_xref="GeneID:5685411"
 CDS             241..381
                 /locus_tag="AM1_A0002"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520661.1"
                 /db_xref="GI:158339654"
                 /db_xref="GeneID:5685411"
                 /translation="MLINPEDKQVEIYRPGQDVELLQSPSTISGADVLPEFSLNLEWI
                 WR"
 gene            388..525
                 /locus_tag="AM1_A0003"
                 /db_xref="GeneID:5685412"
 CDS             388..525
                 /locus_tag="AM1_A0003"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_001520662.1"
                 /db_xref="GI:158339655"
                 /db_xref="GeneID:5685412"
                 /translation="MKEAGFSENSRSREGQPKLAKDAAIAKPYLVAMTAELQIMATET
                 L"

今、私が欲しいのは、次のように、すべての辞書に 1 つの遺伝子に関する情報が含まれている辞書のリストを作成することです。

gene_1 = {"locus": /locus_tag, "product": /product, ...}
gene_2 = {"locus": /locus_tag, "product": /product, ...}

ある遺伝子/辞書がいつ終了し、次の遺伝子/辞書が開始されるべきかをPythonに知らせる方法がまったくわかりません。
誰か助けてくれませんか？これを行う方法はありますか？

明確にするために：必要な情報を抽出し、変数に保存して辞書に入れる方法を知っています。遺伝子ごとに 1 つの辞書を作成するように Python に指示する方法がわかりません。

score 1 · Accepted Answer

私は多分あまり良くないかもしれませんが、この純粋なPythonのための機能的なパーサーをまとめました、多分それは少なくとも基本的なアイデアとして使用することができます：

import re
import pprint
printer = pprint.PrettyPrinter(indent=4)

with open("entities.txt", "r") as file_obj:
    entities = list()

    for line in file_obj.readlines():
        line = line.replace('\n', '')

        if re.match(r'\s*(gene|CDS)\s+[\w(\.,)]+', line):
            parts = line.split()
            entity = {parts[0]: parts[1]}
            entities.append(entity)
        else:
            try:
                (attr_name,) = re.findall(r'/\w+=', line)
                attr_name = attr_name.strip('/=')
            except ValueError:
                addition = line.strip()
                entity[last_key] = ''.join([entity[last_key], addition])
            else:
                try:
                    (attr_value,) = re.findall(r'="\w+$', line)
                    last_key = attr_name
                except ValueError:
                    try:
                        (attr_value,) = re.findall(r'="[\w\s\.:,-]+"', line)
                    except ValueError:
                        (attr_value,) = re.findall(r'=\d+$', line)

                    attr_value = attr_value.strip('"=')

                if attr_name in entity:
                    entity[attr_name] = [entity[attr_name], attr_value]
                else:
                    entity[attr_name] = attr_value

printer.pprint(entities)

score 0 · Accepted Answer

誰かが私が受け取ったコメントの助けを借りて見つけた初心者向けのソリューションに興味がある場合は、ここにあります:

import sys, re

annot = file("example.embl", "r")
embl = ""
annotation = []

for line in annot:
    embl += line

embl_list = embl.split("FT   gen")

for item in embl_list:
    if "e            " in item:
        split_item = item.split("\n")
        for l in split_item:
            if "e            " in l:
                if not "complement" in l:
                    coordinates = l[13:len(l)]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "+"
                if "complement" in l:
                    coordinates = l[24:len(l)-1]
                    C = coordinates.split("..")
                    genestart = C[0]
                    geneend = C[1]
                    strand = "-"

            if "/locus_tag" in l:
                L = l.split('"')
                locus = L[1]

            if "/product" in l:
                P = l.split('"')
                product = P[1]

        annotation.append({
            "locus": locus,
            "genestart": genestart,
            "geneend": geneend,
            "product": product,
        })
    else:
        print "Finished!"

python - Python で 1 つのファイルから複数の辞書へ

2 に答える 2

Related

Reference