python - 大きなテキストファイルと別のテキストファイルの対応する行のバーコード分割（python）

Question

私はPythonで自分自身をより良くしようとしています。これらを行うためのツールはいくつかありますが、2つの理由から自分でやりたいと思っています。

より良い方法を学ぶ
操作の柔軟性

まったく同じサイズ、同じ行数の2つのテキストファイルがあります。テキストの2、6行目（毎回+4）をチェックし、その最初のテキストを確認し、事前定義されたテキストに類似しているかどうかを確認する必要があります。類似している場合は、対応するファイルに4のブロックと一緒にその行を書き込み、同じものを書き込みます。別の対応するファイルの行。（なじみのあるもののように聞こえた人のために、私はイルミナのペアエンドシーケンスデータからバーコードデータを逆多重化しようとしています）。

私はすでに動作するコードを持っていますが、問題は完了するのに数日かかることです。10万回線で約10分かかり、2億回線あります。

私が考えていることと一緒にここにコードを投稿しています。OK、私は100個のキーを持っています、それらはATCCGG、ACCTGG...などと言います。ただし、不一致が1つある場合は、それを正しいと見なします。たとえば、DOGにはAOG、BOG、DIG、DAG、DOF、DOHなどがあります。

def makehamming2(text,dist):

    dicthamming=dict()
    rep=["A","T","C","G"]

    if dist==1:
        for i in range(len(text)):

            for j in range(len(rep)):
                chars=list(text)
                if rep[j]<>chars[i]:
                    chars[i]=rep[j]
                    word="".join(chars)
                    dicthamming[word]=text
    return dicthamming

私はdist=1を使用しています。

私はこの機能を100個のバーコードに使用しているので、辞書には約100*18個のアイテムがあります。

count=0
eachline=1
writeflag=0
seqlen=int(seqlen)
cutlen=len(cutsite)
infile=open(inf, "r")
for line in infile:
        count+=1
        if eachline==1:
            writeflag=0
            header=line
            eachline=2
        elif eachline==2:
            eachline=3
            line=line.strip()
            if line[0:6] in searchdict.keys():

            barcode=searchdict[line[0:6]]

            towritefile=outfile+"/"+barcode+".fastq"


            seq=line[6:seqlen+6]
            qualstart=6
            writeflag=1
            seqeach[barcode]=seqeach.get(barcode,0)+1

    elif eachline==3:
        eachline=4
        third=line
    elif eachline==4:

        eachline=1
        line=line.strip()
        if writeflag==1:
            qualline=line[qualstart:qualstart+seqlen]
            addToBuffer=header+seq+"\n"+third+qualline+"\n"
            bufferdict[towritefile]=bufferdict.get(towritefile,"")+addToBuffer


            Fourlinesofpair=getfrompair(inf2,count, seqlen)


            bufferdictpair[towritefile[:-6:]+"_2.fastq"]=\
            bufferdictpair.get(towritefile[:-6:]+"_2.fastq","")+Fourlinesofpair

                if (count/4)%10000==0:
                    print "writing" , str((count/4))
                    for key, val in bufferdict.items():

                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdict=dict()


                    for key, val in bufferdictpair.items():


                        writefile1=open(key,"a")
                        writefile1.write(val)
                        bufferdictpair=dict()


                    end=(time.time()-start)/60.0
                    print "finished writing", str(end) , "minutes"


    print "writing" , str(count/4)                
    for key, val in bufferdict.items():


        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdict=dict()
        writefile1.close()
    for key, val in bufferdictpair.items():

        writefile1=open(key,"a")
        writefile1.write(val)
        bufferdictpair=dict()
        writefile1.close()

    end=(time.time()-start)/60.0
    print "finished writing", str(end) , "minutes"

getfrompairは関数であり、

def getfrompair(inf2, linenum, length):

    info=open(inf2,"r")
    content=""
    for count, line in enumerate(info):
        #print str(count)

    if count == linenum-4:
        content=line
    if count == linenum-3:
        content=content+line.strip()[:length]+"\n"
    if count == linenum-2:
        content=content+line
    if count == linenum-1:
        content=content+line.strip()[:length]+"\n"
        #print str(count), content



        return content

ですから、私の主な質問は、どうすればそれを最適化できるかということです。ほとんどの場合、このコードは少なくとも8GBのメモリと4コアを超えるプロセッサで実行されると想定しています。マルチプロセッサを使用できますか？各行の後にディスクに書き込むよりも高速だったため、ここでは別のスレッドで提案からのバッファーを使用しました。

教えてくれてありがとう。

編集 1Ignacioの提案の後、プロファイリングを行い、「getfrompair」関数が実行時間の半分以上を費やしていますか？いつかそれぞれを通過せずにファイルから特定の行を取得するためのより良い方法はありますか？

分数からのプロファイル結果（元の8億行ではなく、10000行）

     68719 function calls in 2.902 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       66    0.000    0.000    0.000    0.000 :0(append)
       32    0.003    0.000    0.003    0.000 :0(close)
     2199    0.007    0.000    0.007    0.000 :0(get)
        8    0.002    0.000    0.002    0.000 :0(items)
        3    0.000    0.000    0.000    0.000 :0(iteritems)
      750    0.001    0.000    0.001    0.000 :0(join)
     7193    0.349    0.000    0.349    0.000 :0(keys)
    39977    0.028    0.000    0.028    0.000 :0(len)
        1    0.000    0.000    0.000    0.000 :0(mkdir)
      767    0.045    0.000    0.045    0.000 :0(open)
      300    0.000    0.000    0.000    0.000 :0(range)
        1    0.005    0.005    0.005    0.005 :0(setprofile)
       96    0.000    0.000    0.000    0.000 :0(split)
        1    0.000    0.000    0.000    0.000 :0(startswith)
        1    0.000    0.000    0.000    0.000 :0(stat)
     6562    0.016    0.000    0.016    0.000 :0(strip)
        4    0.000    0.000    0.000    0.000 :0(time)
       48    0.000    0.000    0.000    0.000 :0(update)
       46    0.004    0.000    0.004    0.000 :0(write)
      733    1.735    0.002    1.776    0.002 RC14100~.PY:273(getfrompair)
        1    0.653    0.653    2.889    2.889 RC14100~.PY:31(split)
        1    0.000    0.000    0.000    0.000 RC14100~.PY:313(makehamming)
        1    0.000    0.000    0.005    0.005 RC14100~.PY:329(processbc2)
       48    0.003    0.000    0.005    0.000 RC14100~.PY:344(makehamming2)
        1    0.006    0.006    2.896    2.896 RC14100~.PY:4(<module>)
     4553    0.015    0.000    0.025    0.000 RC14100~.PY:74(<genexpr>)
     2659    0.014    0.000    0.023    0.000 RC14100~.PY:75(<genexpr>)
     2659    0.013    0.000    0.023    0.000 RC14100~.PY:76(<genexpr>)
        1    0.001    0.001    2.890    2.890 RC14100~.PY:8(main)
        1    0.000    0.000    0.000    0.000 cProfile.py:5(<module>)
        1    0.000    0.000    0.000    0.000 cProfile.py:66(Profile)
        1    0.000    0.000    0.000    0.000 genericpath.py:15(exists)
        1    0.000    0.000    0.000    0.000 ntpath.py:122(splitdrive)
        1    0.000    0.000    0.000    0.000 ntpath.py:164(split)
        1    0.000    0.000    0.000    0.000 os.py:136(makedirs)
        1    0.000    0.000    2.902    2.902 profile:0(<code object <module> at 000000000211A9B0, file "RC14100~.PY", line 4>)
        0    0.000             0.000          profile:0(profiler)



Process "Profile" terminated, ExitCode: 00000000

score 1 · Accepted Answer

一致getfrompairするたびに2番目のファイルを読み取るため、関数はこれを古典的なO（n ^ 2）問題にします。代わりに実行したいのは、両方のファイルから同時に読み取られるため、1回だけ実行することです。izipそれを行う方法です。

from itertools import izip

for line,line2 in izip(infile, infile2):

score 0 · Accepted Answer

>>> def dist(w1,w2):
...     return len(w1)-sum(map(lambda x:int(x[0]==x[1]),zip(w1,w2)))
...
>>> dist("DOG","FOG")
1
>>> dist("DOG","FOF")
2
>>> words = ["DOG","FOG","DAG","CAT","RAT","AOG","AAG"]
>>> print filter(lambda x:dist(target,x)<2,words)
['DOG', 'FOG', 'DAG', 'AOG']

その後、あなたがやりたいことをする

>>> import itertools
>>> my_alphabet = ["A","T","C","G"]
>>> target = "ATG"
>>> print filter(lambda x:dist(x,target)<2,itertools.permutations(my_alphabet,len(target)))
[('A', 'T', 'C'), ('A', 'T', 'G'), ('A', 'C', 'G'), ('C', 'T', 'G')]

python - 大きなテキストファイルと別のテキストファイルの対応する行のバーコード分割（python）

2 に答える 2

Related

Reference