python - Pythonで大きなテキストファイル内の文字列を検索する高速な方法

Question

それが私の現在の状況です。

アルファベット順にソートされた約25万の文字列を含む2.5MBのテキストファイルがあります
各文字列は一意です
テキストファイルのエントリを変更する必要はありません。テキストファイルが読み込まれると、編集されることはありません。
テキストファイルは開始時にロードされ、それから文字列を検索する必要があります

最後のポイントは問題です。実際には、文字列の完全一致と部分一致を検索する必要があります。私が書いたアルゴリズムは、プロセスを高速化するためのいくつかの試みと組み合わせた正規表現の使用を含んでいました。たとえば、アルファベットの単数文字を識別する辞書のインデックスをスクリプトにハードコーディングしてから、大きなテキストファイルを分割しました。 26の小さな辞書に架空のもの。それはまったく役に立たなかった、スクリプトはまだ信じられないほど遅い。ここでいくつかの投稿をざっと読んで、mmapを試してみると確信しました。しかし、正規表現を考えると、すべての部分一致を見つけるのは役に立たないように見えました。結局、これが何であるかはほとんどわかりませんが、トライが私の問題を解決するかもしれないという結論に達しました。私は試して行くべきですか？もしそうなら、Pythonでトライの作成にどのように進むべきですか？marisa-trieモジュールは良いですか？みんなありがとう

編集：「部分一致」とは、文字列のプレフィックスがあることを意味します。私は最初の最後や途中で試合をする必要はありません。

score 5 · Accepted Answer

最も簡単で最速のソリューション：

#!/usr/bin/env python

d = {}

# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
    line = line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        d[line[:i]] = True
f.close()


while True:
    w = raw_input('> ')
    if not w:
        break

    if w in d:
        print "match found", w

これは少し複雑ですが、メモリ効率が良いものです。

#!/usr/bin/env python

d = []

def binary_search(a, x, lo=0, hi=None):
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        midval = a[mid]
        if midval < x:
            lo = mid+1
        elif midval > x:
            hi = mid
        else:
            return mid
    return -1


f = open("/etc/hosts","r")
for line in f:
    line=line.rstrip()
    l = len(line)+1
    for i in xrange(1,l):
        x = hash(line[:i])
        d.append(x)
f.close()

d.sort()

while True:
    w = raw_input('> ')
    if not w:
        break

    if binary_search(d, hash(w)) != -1:
        print "match found", w

score 2 · Accepted Answer

ファイルはすでに並べ替えられて読み込まれているため、凝ったデータ構造に頼ることなく、ファイルでバイナリ検索を使用できます。Pythonには、 bisect.bisect_left`というバイナリ検索関数が組み込まれています。

score 1 · Accepted Answer

トライを使用します。

#dictionary is a list of words
def parse_dictionary(dictionary):
    dictionary_trie = {}
    for word in dictionary:
        tmp_trie = dictionary_trie
        for letter in word:
            if letter not in tmp_trie:
                tmp_trie[letter] = {}
            if 'words' not in tmp_trie[letter]:
                tmp_trie[letter]['words'] = []

            tmp_trie[letter]['words'].append(word)
            tmp_trie = tmp_trie[letter]
    return dictionary_trie

def matches(substring, trie):
    d = trie
    for letter in substring:
        try:
            d = d[letter]
        except KeyError:
            return []
    return d['words']

使用例：

>>> import pprint
>>> dictionary = ['test', 'testing', 'hello', 'world', 'hai']
>>> trie = parse_dictionary(dictionary)
>>> pprint.pprint(trie)
{'h': {'a': {'i': {'words': ['hai']}, 'words': ['hai']},
       'e': {'l': {'l': {'o': {'words': ['hello']}, 'words': ['hello']},
                   'words': ['hello']},
             'words': ['hello']},
       'words': ['hello', 'hai']},
 't': {'e': {'s': {'t': {'i': {'n': {'g': {'words': ['testing']},
                                     'words': ['testing']},
                               'words': ['testing']},
                         'words': ['test', 'testing']},
                   'words': ['test', 'testing']},
             'words': ['test', 'testing']},
       'words': ['test', 'testing']},
 'w': {'o': {'r': {'l': {'d': {'words': ['world']}, 'words': ['world']},
                   'words': ['world']},
             'words': ['world']},
       'words': ['world']}}
>>> matches('h', trie)
['hello', 'hai']
>>> matches('he', trie)
['hello']
>>> matches('asd', trie)
[]
>>> matches('test', trie)
['test', 'testing']
>>>

score 0 · Accepted Answer

リストを作成し、各行をリストの1つの要素にして、バイナリ検索を実行できます。

score 0 · Accepted Answer

トライを使用するには、ファイル全体を反復するO（n）であるトライを作成する必要があります。並べ替えを利用するとO（log_2 n）になります。したがって、このより高速なソリューションでは、バイナリ検索を使用します（以下を参照）。

このソリューションでも、ファイル全体を読み込む必要があります。さらに高速なソリューションでは、ファイルを前処理し、すべての行をパディングして同じ長さにすることができます（または、ファイル内にある種のインデックス構造を構築して、リストの中央を検索できるようにします）- -次に、ファイルの中央に移動しようとすると、リストの中央に移動します。「さらに高速な」ソリューションは、おそらく本当に非常に大きなファイル（ギガバイトまたは数百メガバイト）にのみ必要です。あなたは彼らがこれを二分探索と組み合わせるでしょう。

おそらく、ファイルシステムがスパースファイルをサポートしている場合、上記のパディングスキームを実行しても、ディスクで実際に使用されているファイルのブロック数は増加しません。

次に、その時点で、インデックス作成を効率的にするために、おそらくb-treeまたはb+treeの実装に近づいています。したがって、 bツリーライブラリを使用できます。

このようなもの：

import bisect

entries = ["a", "b", "c", "cc", "cd", "ce", "d", "e", "f" ]

def find_matches(ls, m):

    x = len(ls) / 2
    match_index = -1

    index = bisect.bisect_left(ls, m)
    matches = []

    while ls[index].startswith(m):
        matches.append(ls[index])
        index += 1

    return matches

print find_matches(entries, "c")

出力：

>>> ['c', 'cc', 'cd', 'ce']

score 0 · Accepted Answer

したがって、arainchiの非常に良い答えを説明するために、ファイルのすべての行のエントリを含む辞書を作成します。次に、検索文字列をそれらのエントリの名前と照合できます。辞書は、この種の検索に非常に便利です。

python - Pythonで大きなテキストファイル内の文字列を検索する高速な方法

6 に答える 6

Related

Reference