python - Python - 可変属性と行の長さでファイルからデータを読み取る

Question

Python でファイルを解析し、各タプルが単一のデータエンティティとその属性を表す名前付きタプルのリストを作成する最良の方法を見つけようとしています。データは次のようになります。

UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality

UI: T145   
RL: exhibits   
ABR: EX   
RIN: exhibited_by   
RTN: R3.3.2   
DEF: Shows or demonstrates.   
HL: {isa} performs   
STL: [Animal|Behavior]; [Group|Behavior]   

UI: etc...

いくつかの属性は共有されますが (UI など)、一部の属性は共有されません (STY など)。ただし、必要なものの完全なリストをハードコードすることはできます。
各グループは空の行で区切られているため、データの各チャンクを個別に処理できるように分割を使用しました。

input = file.read().split("\n\n")
for chunk in input:
     process(chunk)

文字列の検索/スプライス、itertools.groupby、さらには正規表現を使用するいくつかのアプローチを見てきました。ヘッダーの場所を見つけるために「[AZ]*:」の正規表現を実行することを考えていましたが、後で別のヘッダーに到達するまで複数行を引き出す方法がわかりません (DEF の後の複数行のデータなど)。最初の例のエンティティ)。

提案をいただければ幸いです。

score 2 · Accepted Answer

source = """
UI: T020  
STY: Acquired Abnormality  
ABR: acab   
STN: A1.2.2.2  
DEF: An abnormal structure, or one that is abnormal in size or location, found   
in or deriving from a previously normal structure.  Acquired abnormalities are  
distinguished from diseases even though they may result in pathological   
functioning (e.g., "hernias incarcerate").   
HL: {isa} Anatomical Abnormality
"""

inpt = source.split("\n")  #just emulating file

import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
    line_match = reg.match(line) #check if we hit the CODE: Content line
    if line_match is not None:
        if current_key is not None:
            output[current_key] = current #if so - update the current_key with contents
        current_key = line_match.group(1)   
        current = line_match.group(2)
    else:
        current = current + line   #if it's not - it should be the continuation of previous key line

output[current_key] = current #don't forget the last guy
print(output)

score 2 · Accepted Answer

文字列が複数の行にまたがっている場合、改行をスペースに置き換えたい（そして追加のスペースを削除したい）と仮定しました。

def process_file(filename):
    reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
    tmp = '' # Stored/cached data for mutliline string
    key = None # Current key
    data = {}

    with open(filename,'r') as f:
        for row in f:
            row = row.rstrip()
            match = reg.match(row)

            # Matches header or is end, put string to list:
            if (match or not row) and key:
                data[key] = tmp
                key = None
                tmp = ''

            # Empty row, next dataset
            if not row:
                # Prevent empty returns
                if data:
                    yield data
                    data = {}

                continue

            # We do have header
            if match:
                key = str(match.group(1))
                tmp = row[len(match.group(0)):]
                continue

            # No header, just append string -> here goes assumption that you want to
            # remove newlines, trailing spaces and replace them with one single space
            tmp += ' ' + row

    # Missed row?
    if key:
        data[key] = tmp

    # Missed group?
    if data:
        yield data

このジェネレーターは、各反復dictのようなペアを返しUI: T020ます (常に少なくとも 1 つのアイテム)。

ジェネレーターと連続読み取りを使用するため、大きなファイルに対して効果的なイベントである必要があり、ファイル全体を一度にメモリに読み取ることはありません。

ここに小さなデモがあります:

for data in process_file('data.txt'):
    print('-'*20)
    for i in data:
        print('%s:'%(i), data[i])

    print()

そして実際の出力：

--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure.  Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab

--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

score 0 · Accepted Answer

import re
from collections import namedtuple

def process(chunk):
    split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
    d = dict()
    fields = list()
    for i in xrange(len(split_chunk)/2):
        fields.append(split_chunk[i])
        d[split_chunk[i]] = split_chunk[i+1]
    my_tuple = namedtuple(split_chunk[1], fields)
    return my_tuple(**d)

すべきです。私はただそうするだろうと思いdictます-なぜあなたはそんなにに執着しているのnamedtupleですか？

python - Python - 可変属性と行の長さでファイルからデータを読み取る

3 に答える 3

Related

Reference