私はバイオインフォマティクスの研究を行っていますが、Python は初めてです。タンパク質配列を含むファイルを解釈するために、このコードを書きました。ファイル「bulk_sequences.txt」には、71,423 行の情報が含まれています。3 行は 1 つのタンパク質配列を参照しており、この最初の行は、タンパク質が発見された年を含む情報を提供します (これが "/1945" の内容です)。1000 行の小さなサンプルでは、問題なく機能します。しかし、私が与えたこの大きなファイルでは、非常に時間がかかるようです.これを簡単にするために何かできることはありますか?
ファイルを並べ替え、発見年で並べ替え、3 行すべてのタンパク質配列データを配列 "sortedsqncs" 内の項目に割り当てることを意図しています。
import time
start = time.time()
file = open("bulk_sequences.txt", "r")
fileread = file.read()
bulksqncs = fileread.split("\n")
year = 1933
newarray = []
years = []
thirties = ["/1933","/1934","/1935","/1936","/1937","/1938","/1939","/1940","/1941","/1942"]## years[0]
forties = ["/1943","/1944","/1945","/1946","/1947","/1948","/1949","/1950","/1951","/1952"]## years[1]
fifties = ["/1953","/1954","/1955","/1956","/1957","/1958","/1959","/1960","/1961","/1962"]## years[2]
sixties = ["/1963","/1964","/1965","/1966","/1967","/1968","/1969","/1970","/1971","/1972"]## years[3]
seventies = ["/1973","/1974","/1975","/1976","/1977","/1978","/1979","/1980","/1981","/1982"]## years[4]
eighties = ["/1983","/1984","/1985","/1986","/1987","/1988","/1989","/1990","/1991","/1992"]## years[5]
nineties = ["/1993","/1994","/1995","/1996","/1997","/1998","/1999","/2000","/2001","/2002"]## years[6]
twothsnds = ["/2003","/2004","/2005","/2006","/2007","/2008","/2009","/2010","/2011","/2012"]## years[7]
years = [thirties,forties,fifties,sixties,seventies,eighties,nineties,twothsnds]
count = 0
sortedsqncs = []
for x in range(len(years)):
for i in range(len(years[x])):
for y in bulksqncs:
if years[x][i] in y:
for n in range(len(bulksqncs)):
if y in bulksqncs[n]:
sortedsqncs.append(bulksqncs[n:n+3])
count +=1
print len(sortedsqncs)
end = time.time()
print round((end - start),4)