python - Python はファイル内の 100 万個の文字列を検索し、各文字列の出現回数を数えます

Question

これは、それを行うための最速の方法を見つけることです。別の行に約100万個の文字列（長さ6〜40）を含むfile1があります。約 80,000 個の文字列を含む別の file2 でそれらのそれぞれを検索し、出現回数をカウントします (1 つの文字列で小さな文字列が複数回見つかった場合、この文字列の出現回数は 1 のままです)。パフォーマンスの比較に興味がある人は、file1 と file2 をダウンロードするためのリンクがあります。dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m

私が今行っているのは、ファイル 2 の辞書を作成し、文字列 ID をキーとして、文字列を値として使用することです。(file2 の文字列は値が重複しているため、文字列 ID のみが一意であるため) 私のコードは

for line in file1:
   substring=line[:-1].split("\t")
   for ID in dictionary.keys():
       bigstring=dictionary[ID]
       IDlist=[]
       if bigstring.find(substring)!=-1:
           IDlist.append(ID)
   output.write("%s\t%s\n" % (substring,str(len(IDlist))))

私のコードは完了するまでに数時間かかります。誰でもそれを行うためのより速い方法を提案できますか? file1 と file2 はどちらも約 50M です。私の PC には 8G のメモリがあり、高速化するために必要なだけメモリを使用できます。1時間で終わる方法ならなんでもOKです:)

ここで、以下のコメントからいくつかの提案を試した後、パフォーマンスの比較を参照してください。最初にコードが来て、次に実行時間です。

Mark Amery および他の人々によって提案されたいくつかの改善

import sys
from Bio import SeqIO

#first I load strings in file2 to a dictionary called var_seq, 
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[record.id]=str(record.seq)

print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance

output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
    icount+=1
    row=line[:-1].split("\t")
    ensp=row[0]   #ensp is just peptides iD
    peptide=row[1] #peptides is the substrings i want to search in file2
    num=0
    for ID,bigstring in var_seq.iteritems(): 
        if peptide in bigstring:
            num+=1

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

完了するまでに 1m4 秒かかります。私の古いものと比較して改善された20代

#######エントロピーが提案する次の方法

from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
    var_seq[str(record.seq)]+=1

print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value 
handle.close()

output=open(outputfilename,'w')
icount=0

with open(file1) as fd:
    for line in fd:
        icount+=1
        row=line[:-1].split("\t")
        ensp=row[0]
        peptide=row[1]
        num=0
        for varseq,num_occurrences in var_seq.items():
            if peptide in varseq:
                num+=num_occurrences

    newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
    output.write(newline)
    if icount%1000==0:
        break

output.close()

これには 1 分 10 秒かかりますが、重複の検索を回避するため、期待したほど速くはありません。理由がわかりません。

Mark Amery によって提案された Haystack と Needle の方法が最速であることが判明しました。この方法の問題は、すべての部分文字列のカウント結果が 0 になることです。これはまだわかりません。

これが私が彼の方法を実装したコードです。

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

input1=open(file1,'r')
needles={}
for line in input1:
    row=line[:-1].split("\t")
    needles[row[1]]=row[0]

print len(needles)

handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
    haystacks[record.id]=str(record.seq)

print len(haystacks)

for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

icount=0
output=open(outputfilename,'w')
for needle in needles:
    icount+=1
    count = search_haystack_tree(needle)
    newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
    output.write(newline)
    if icount%1000==0:
        break

input1.close()
handle.close()
output.close()

完了するのに 0 分 11 秒しかかからず、他の方法よりもはるかに高速です。しかし、カウント結果をすべて0にするのは私のミスなのか、マークの方法に欠陥があるのかわかりません。

score 3 · Accepted Answer

コードが機能していないようです (実際のコードを貼り付けるのではなく、メモリから引用しただけではありませんか?)

たとえば、次の行です。

substring=line[:-1].split("\t")

substringt はリストになります。しかし、後で次のようにします。

if bigstring.find(substring)!=-1:

を呼び出すと、エラーが発生しますstr.find(list)。

いずれにせよ、最も内側のループで無駄にリストを作成しています。これ：

IDlist=[]
if bigstring.find(substring)!=-1:
     IDlist.append(ID)

 #later
 len(IDlist)

それはリストを無用に割り当てて解放し、メモリのスラッシングを引き起こし、すべてを無用に行き詰まらせます。

これは機能するはずのコードであり、より効率的な手段を使用してカウントを行います。

from collections import defaultdict

dictionary = defaultdict(int)
with open(file2) as fd:
    for line in fd:
        for s in line.split("\t"):
            dictionary[s.strip()] += 1

with open(file1) as fd:
    for line in fd:
        for substring in line.split('\t'):
            count = 0
            for bigstring,num_occurrences in dictionary.items():
                if substring in bigstring:
                    count += num_occurrences
            print substring, count

PS：ある時点でタブ分割されているため、1行に複数の単語があると想定していますline.split("\t")。それが間違っている場合、コードを修正するのは簡単なはずです。

PPS：これが遅すぎて使用できない場合（試してみる必要がありますが、あなたが言った文字列の数を考えると、これは〜10分で実行されるはずです）。提案されたコメントの 1 つとしてサフィックスツリーを使用する必要があります。

file2編集:パフォーマンスに悪影響を与えることなく、同じ文字列の複数回の出現を処理するようにコードを修正しました

編集 2: 最大スペースを時間と交換します。

以下は、大量のメモリを消費し、ディクショナリの作成に時間がかかるコードです。ただし、これが完了すると、100 万個の文字列を検索するたびに、1 回のハッシュテーブルルックアップにかかる時間、つまりO(1).

プロセスの各ステップにかかる時間を記録するために、いくつかのステートメントを追加したことに注意してください。検索時にどの部分が費やされたかがわかるように、それらを保持する必要があります。1000 個の文字列のみでテストしているため、コストの 90% が検索時間ではなくビルド時間である場合、100 万個の文字列でテストする場合でも、1 回しか実行しないため、これは非常に重要です。案件

また、ファイル 1 とファイル 2 を解析するようにコードを修正したので、これをプラグインしてテストできるはずです。

from Bio import SeqIO
from collections import defaultdict
from datetime import datetime

def all_substrings(s):
    result = set()
    for length in range(1,len(s)+1):
        for offset in range(len(s)-length+1):
            result.add(s[offset:offset+length])
    return result

print "Building dictionary...."
build_start = datetime.now()

dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
    for sub in all_substrings(str(record.seq).strip()):
        dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()

print "Searching...\n"
search_start = datetime.now()

with open(file1) as fd:
    for line in fd:
        substring = line.strip().split("\t")[1]
        count = dictionary[substring]
        print substring, count

search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()

score 0 · Accepted Answer

私はアルゴリズムの専門家ではありませんが、これによりパフォーマンスが大幅に向上すると思います。「haystacks」を調べたい大きな単語のリストに設定し、「needles」を探しているサブストリングのリストに設定する必要があります（どちらも重複を含めることができます）。あなたの側で実装します。提案された解決策のパフォーマンスを簡単に比較できるように、針のリストと干し草の山のリストを投稿できれば素晴らしいと思います。

haystacks = <some list of words>
needles = <some list of words>

class Node(object):
    def __init__(self):
        self.words = set()
        self.links = {}

base = Node()

for haystack_id, haystack in enumerate(haystacks):
    for i in xrange(len(haystack)):
        current_node = base
        for char in haystack[i:]:
            current_node = current_node.links.setdefault(char, Node())
            current_node.words.add(haystack_id)

def search_haystack_tree(needle):
    current_node = base
    for char in needle:
        try:
            current_node = current_node.links[char]
        except KeyError:
            return 0
    return len(current_node.words)

for needle in needles:
    count = search_haystack_tree(needle)
    print "Count for %s: %i" % (needle, count)

コードを見れば何が起こっているのか理解できるかもしれませんが、言葉で言えば、干し草の山の単語の部分文字列の巨大なツリーを構築します。針があれば、文字ごとにツリーをナビゲートして終了できます。そのサブストリングを含む干し草スタックのすべての干し草スタックIDのセットをそれに接続しているノードでアップします。次に、針ごとにツリーを調べて、最後にセットのサイズを数えます。

python - Python はファイル内の 100 万個の文字列を検索し、各文字列の出現回数を数えます

2 に答える 2

Related

Reference