python - Pythonは、スライディングウィンドウ分析を繰り返すたびにステップサイズを不可解に短縮しています

Question

私は、染色体を横切る一連のスライディングウィンドウで統計タジマのDを推定するプログラムに取り組んでいます。染色体自体も、（うまくいけば）機能的に重要ないくつかの異なる領域に分割されています。スライディングウィンドウ分析は、各領域で私のスクリプトによって実行されます。

プログラムの開始時に、スライドウィンドウのサイズと、あるウィンドウから次のウィンドウに移動するステップのサイズを定義します。異なる染色体領域ごとの座標を含むファイルをインポートし、作業しているすべてのSNPデータを含む別のファイルをインポートします（これは大きなファイルであるため、行ごとに読み取られます）。プログラムは染色体位置のリストをループします。場所ごとに、分析用のステップとウィンドウのインデックスを生成し、SNPデータを出力ファイル（ステップに対応）に分割し、各ステップファイルの主要な統計を計算し、これらの統計を組み合わせて各ウィンドウの但馬のDを推定します。

このプログラムは、SNPデータの小さなファイルに適しています。また、最初の染色体ブレークポイントでの最初の反復でもうまく機能します。ただし、SNPデータの大きなファイルの場合、プログラムが各染色体領域を反復処理するため、分析のステップサイズは不可解に減少します。最初の染色体領域の場合、ステップサイズは2500ヌクレオチドです（これが想定されているものです）。ただし、2番目の染色体セグメントの場合、ステップサイズは1966であり、3番目の場合は732です。

なぜそうなるのかについて誰かが何か提案があれば、私に知らせてください。このプログラムは小さなファイルでは動作するように見えますが、大きなファイルでは動作しないように見えるので、私は特に困惑しています。

私のコードは以下の通りです：

import sys
import math
import fileinput
import shlex
import string
windowSize = int(500)
stepSize = int(250)
n = int(50)     #number of individuals in the anaysis
SNP_file = open("SNPs-1.txt",'r')
SNP_file.readline()
breakpoints = open("C:/Users/gwilymh/Desktop/Python/Breakpoint coordinates.txt", 'r')
breakpoints = list(breakpoints)
numSegments = len(breakpoints)
# Open a file to store the Tajima's D results:
outputFile = open("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/Tajima's D estimates.txt", 'a')
outputFile.write(str("segmentNumber\tchrSegmentName\tsegmentStart\tsegmentStop\twindowNumber\twindowStart\twindowStop\tWindowSize\tnSNPs\tS\tD\n"))

#Calculating parameters a1, a2, b1, b2, c1 and c2
numPairwiseComparisons=n*((n-1)/2)
b1=(n+1)/(3*(n-1))
b2=(2*(n**2+n+3))/(9*n*(n-1))
num=list(range(1,n))                # n-1 values as a list
i=0
a1=0
for i in num:
   a1=a1+(1/i)
   i=i+1
j=0
a2=0
for j in num:
    a2=a2+(1/j**2)
    j=j+1
c1=(b1/a1)-(1/a1**2)
c2=(1/(a1**2+a2))*(b2 - ((n+2)/(a1*n))+ (a2/a1**2) )

counter6=0
#For each segment, assign a number and identify the start and stop coodrinates and the segment name
for counter6 in range(counter6,numSegments):
    segment = shlex.shlex(breakpoints[counter6],posix = True)
    segment.whitespace += '\t'
    segment.whitespace_split = True
    segment = list(segment)
    segmentName = segment[0]
    segmentNumber = int(counter6+1)
    segmentStartPos = int(segment[1])
    segmentStopPos = int(segment[2])
    outputFile1 = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'a')

#Make output files to index the lcoations of each window within each segment
    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
    k = segmentStartPos - 1
    windowNumber = 0
    while (k+1) <=segmentStopPos:
        windowStart = k+1
        windowNumber = windowNumber+1
        windowStop = k + windowSize 
        if windowStop > segmentStopPos:
            windowStop = segmentStopPos
        windowFileIndex.write(("%s\t%s\t%s\n")%(str(windowNumber),str(windowStart),str(windowStop)))
        k=k+stepSize
    windowFileIndex.close()

# Make output files for each step to export the corresponding SNP data into + an index of these output files
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'a')
    i = segmentStartPos-1
    stepNumber = 0
    while (i+1) <= segmentStopPos:
        stepStart = i+1
        stepNumber = stepNumber+1
        stepStop = i+stepSize 
        if stepStop > segmentStopPos:
            stepStop = segmentStopPos
        stepFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepNumber))), 'a')
        stepFileIndex.write(("%s\t%s\t%s\n")%(str(stepNumber),str(stepStart),str(stepStop)))
        i=i+stepSize
    stepFile.close()
    stepFileIndex.close()

# Open the index file for each step in current chromosomal segment
    stepFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_stepFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
    stepFileIndex = list(stepFileIndex)
    numSteps = len(stepFileIndex)

    while 1:
        currentSNP = SNP_file.readline()
        if not currentSNP: break
        currentSNP = shlex.shlex(currentSNP,posix=True)
        currentSNP.whitespace += '\t'
        currentSNP.whitespace_split = True
        currentSNP = list(currentSNP)
        SNPlocation = int(currentSNP[0])
        if SNPlocation > segmentStopPos:break
        stepIndexBin = int(((SNPlocation-segmentStartPos-1)/stepSize)+1)
        #print(SNPlocation, stepIndexBin)
        writeFile = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(stepIndexBin))), 'a')
        writeFile.write((("%s\n")%(str(currentSNP[:]))))
        writeFile.close()

    counter3=0
    for counter3 in range(counter3,numSteps):
# open up each step in the list of steps across the chromosomal segment:
        L=shlex.shlex(stepFileIndex[counter3],posix=True)
        L.whitespace += '\t'
        L.whitespace_split = True
        L=list(L)
        #print(L)
        stepNumber = int(L[0])
        stepStart = int(L[1])
        stepStop = int(L[2])
        stepSize = int(stepStop-(stepStart-1))
#Now open the file of SNPs corresponding with the window in question and convert it into a list:
        currentStepFile = open(("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_step_%s.txt")%(str(segmentNumber),str(segmentName),str(counter3+1)),'r')
        currentStepFile = list(currentStepFile)
        nSNPsInCurrentStepFile = len(currentStepFile)
        print("number of SNPs in this step is:", nSNPsInCurrentStepFile)
        #print(currentStepFile)
        if nSNPsInCurrentStepFile == 0:
            mismatchesPerSiteList = [0] 
        else:
# For each line of the file, estimate the per site parameters relevent to Tajima's D
            mismatchesPerSiteList = list()
            counter4=0
            for counter4 in range(counter4,nSNPsInCurrentStepFile):
                CountA=0
                CountG=0
                CountC=0
                CountT=0
                x = counter4
                lineOfData = currentStepFile[x]
                counter5=0
                for counter5 in range(0,len(lineOfData)):
                    if lineOfData[counter5]==("A" or "a"): CountA=CountA+1
                    elif lineOfData[counter5]==("G" or "g"): CountG=CountG+1
                    elif lineOfData[counter5]==("C" or "c"): CountC=CountC+1
                    elif lineOfData[counter5]==("T" or "t"): CountT=CountT+1
                    else: continue
                AxG=CountA*CountG
                AxC=CountA*CountC
                AxT=CountA*CountT
                GxC=CountG*CountC
                GxT=CountG*CountT
                CxT=CountC*CountT
                NumberMismatches = AxG+AxC+AxT+GxC+GxT+CxT
                mismatchesPerSiteList=mismatchesPerSiteList+[NumberMismatches]
        outputFile1.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber, segmentName,stepNumber,stepStart,stepStop,stepSize,nSNPsInCurrentStepFile,sum(mismatchesPerSiteList))))
    outputFile1.close()

    windowFileIndex = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_windowFileIndex.txt")%(str(segmentNumber),str(segmentName))), 'r')
    windowFileIndex = list(windowFileIndex)
    numberOfWindows = len(windowFileIndex)
    stepData = open((("C:/Users/gwilymh/Desktop/Python/Sliding Window Analyses-2/%s_%s_Count of SNPs and mismatches per step.txt")%(str(segmentNumber),str(segmentName))), 'r')
   stepData = list(stepData)
    numberOfSteps = len(stepData)

    counter = 0
    for counter in range(counter, numberOfWindows):
        window = shlex.shlex(windowFileIndex[counter], posix = True)
        window.whitespace += "\t"
        window.whitespace_split = True
        window = list(window)
        windowNumber = int(window[0])
        firstCoordinateInCurrentWindow = int(window[1])
        lastCoordinateInCurrentWindow = int(window[2])
        currentWindowSize = lastCoordinateInCurrentWindow - firstCoordinateInCurrentWindow +1
        nSNPsInThisWindow = 0
        nMismatchesInThisWindow = 0

        counter2 = 0
        for counter2 in range(counter2,numberOfSteps):
            step = shlex.shlex(stepData[counter2], posix=True)
            step.whitespace += "\t"
            step.whitespace_split = True
            step = list(step)
            lastCoordinateInCurrentStep = int(step[4])
            if lastCoordinateInCurrentStep < firstCoordinateInCurrentWindow: continue
            elif lastCoordinateInCurrentStep <= lastCoordinateInCurrentWindow:
                nSNPsInThisStep = int(step[6])
                nMismatchesInThisStep = int(step[7])
                nSNPsInThisWindow = nSNPsInThisWindow + nSNPsInThisStep
                nMismatchesInThisWindow = nMismatchesInThisWindow + nMismatchesInThisStep
            elif lastCoordinateInCurrentStep > lastCoordinateInCurrentWindow: break
        if nSNPsInThisWindow ==0 :
            S = 0
            D = 0
        else:
            S = nSNPsInThisWindow/currentWindowSize
            pi = nMismatchesInThisWindow/(currentWindowSize*numPairwiseComparisons)
            print(nSNPsInThisWindow,nMismatchesInThisWindow,currentWindowSize,S,pi)
            D = (pi-(S/a1))/math.sqrt(c1*S + c2*S*(S-1/currentWindowSize))
        outputFile.write(str(("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n")%(segmentNumber,segmentName,segmentStartPos,segmentStopPos,windowNumber,firstCoordinateInCurrentWindow,lastCoordinateInCurrentWindow,currentWindowSize,nSNPsInThisWindow,S,D)))

score 5 · Accepted Answer

簡単に検索するとstepSize、オンライン 110 を変更していることがわかります。

    stepStart = int(L[1])
    stepStop = int(L[2])
    stepSize = int(stepStop-(stepStart-1))

stepStopstepStartファイルの内容に依存しているように見えるため、これ以上デバッグすることはできません。

python - Pythonは、スライディングウィンドウ分析を繰り返すたびにステップサイズを不可解に短縮しています

1 に答える 1

Related

Reference