0

I'm stuck in a script I have to write and can't find a way out...

I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files. The first is simply a table with IDs and group information (which is used for the splitting). The other contains the same IDs, but each twice with slightly different information.

What I'm doing: I create a list of lists with ID and group informazion, like this:

table = [[ID, group], [ID, group], [ID, group], ...]

Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I'm in doubt about this) is adding an -a or -b to the ID:

index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}

The code for this:

for line in file:
    read = (line.split("\t"))[0]
    if not (read+"-a") in indices:
        index = read + "-a"
        length = len(line)
        indices[index] = [FPos, length]
    else:
        index = read + "-b"
        length = len(line)
        indices[index] =  [FPos, length]
    FPos += length

What I am wondering now is if the next step is actually valid (I don't get errors, but I have some doubts about the output files).

for name in table:
    head = name[0]
    ## first round
    (FPos,length) = indices[head+"-a"]
    file.seek(FPos)
    line = file.read(length)
    line = line.rstrip()
    items = line.split("\t")
    output = ["@" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
    name.append(output)
    ##second round
    (FPos,length) = indices[head+"-b"]
    file.seek(FPos)
    line = file.read(length)
    line = line.rstrip()
    items = line.split("\t")
    output = ["@" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
    name.append(output)

Is it ok to use a for loop like that?

Is there a better, cleaner way to do this?

4

1 に答える 1

2

adefaultdict(list)を使用して、すべてのファイル オフセットを ID で保存します。

from collections import defaultdict

index = defaultdict(list)

for line in file:
    # ...code that loops through file finding ID lines...
    index[id_value].append((fileposn,length))

defaultdictは、指定された id_value が最初に出現したときに空のリストへの初期化を処理し、次に (fileposn,length) タプルがそれに追加されます。

これにより、参照の数が 1、2、20 のいずれであっても、各 ID へのすべての参照がインデックスに蓄積されます。次に、指定されたファイル位置を検索して、関連データを見つけることができます。

于 2013-01-04T15:06:49.390 に答える