python - 大きなリスト全体で部分文字列を検索する

Question

2つの文字列の間にあるすべての形式の挿入を見つけようとしています。したがって、1400万の文字列のリストがあり、各文字列について、どのような挿入によって1つの文字列を別の文字列に変換できるかを確認する必要があります（基本的に挿入頻度をカウントします）。xが1つの文字列で、yが別の文字列であり、xがyの部分文字列であるとすると、どの挿入がxをyに変換するかを調べる必要があります。

次のコードセグメントを使用しています。それは機能しますが、多くの時間がかかっています。64個のプロセッサに負荷を分散させようとしましたが、完了するまでに20日かかります。

for i in Words:
#trying to distribute load across different processes, so can ignore this part
   h = hashlib.sha256(i)
   n = int(h.hexdigest(),base=16)
   if (n%64!=ix): #ix is a process based id
    continue


   for j in Words:#
    if len(i)>len(j):
        continue
    if( i!=j and i in j):  # i is a substring of j
        ind=j.find(i)
        s1=j[0:ind]
        s2=j[ind+len(i):len(j)]

                    if(len(s1)>0):
            if (not transform.has_key(s1)):
                transform[s1]=1
            else:
                transform[s1]+=1

        if(len(s2)>0):
            if (not transform.has_key(s2)):
                transform[s2]=1
            else:
                transform[s2]+=1

score 1 · Accepted Answer

各単語を相互に比較する代わりに（二次ランタイム）、各単語の適切なサブストリングをそれぞれ取得し（線形ランタイム、単語の長さが制限されていると仮定）、それが単語のセットに含まれているかどうかを確認します（aの要素の検索setは一定時間です）。

これは私のラップトップでは2秒未満で実行されました（46265ワード（長さ<10）で47015の一意の変換（合計799089））：

from collections import Counter

# for testing
from random import choice, randrange
from string import ascii_uppercase
big_word = "".join(choice(ascii_uppercase) for i in range(10000))
words = [big_word[randrange(len(big_word)):][:randrange(1, 10)] for i in range(100000)] # words of up to 9 letters; all are substrings of big_word  

# now the real code
def insertions(words):
    for word in words:
        for i in range(1, len(word) - 1):
            ins = word[:i]
            rest = word[i:]
            for j in range(1, len(rest)):
                if rest[:j] in words:
                    yield ins
        for i in range(1, len(word) - 1):
            rest = word[:i]
            ins = word[i:]
            for j in range(len(rest) - 1):
                if rest[j:] in words:
                    yield ins

transforms = Counter(insertions(set(words)))

python - 大きなリスト全体で部分文字列を検索する

1 に答える 1

Related

Reference