algorithm - what is the better way to search in millions of file names with wildcard(GLOB) support

Question

i am working on a small search engine to display a matching file names with full path. and important thing is that i need to provide wildcard(GLOB) search like *.doc or *list*.xlx or *timesheet* or ???.doc or something like that.

i found some related solution

Search for strings matching the pattern "abc:*:xyz" in less than O(n)

but i am looking for efficient algorithms which can find matches out of million file names in a less than a second, so better than O(n) is required..

i am thinking of two phase algorithm with substring array (Suffix array + prefix array) search in first phase and normal RegEx search thru the results of first phase second phase.

any help would be greatly appreciated...

score 3 · Accepted Answer

自己索引付けを確認してください:この Stack Overflow questionと、この DrDobbs の記事です。

score 2 · Accepted Answer

私の知る限り、一般化されたグロブ検索で O(n) よりも優れた方法はありません。

ただし、プレフィックス検索とサフィックス検索の特殊なケースでは、ソートされたインデックスを作成してバイナリ検索を実行できます。その結果、プレフィックス検索とサフィックス検索で O(log n) が発生します。プレフィックスインデックスは、最初の文字、次に 2 番目の文字に基づいて並べ替えられます。サフィックスインデックスは、最後の文字に基づいて並べ替えられ、次に最後から 2 番目の文字に基づいて並べ替えられます (逆文字列)。

あなたの投稿で提案されているように、2 つのフェーズで検索を行い、プレフィックスまたはサフィックスインデックスを検索し、続いてグロブから作成された正規表現を使用して、最初のフェーズから提供された削減されたリストをブルートフォース検索します。

文字列の長さの比較は正規表現よりも高速であるため、一致する最小文字列の長さ、または ???.doc の例では固定長の一致する文字列を事前にフィルター処理します。

元の投稿の音から、最終結果が見つかった後に表示できるように、インデックスは各エントリのフルパスも参照する必要があります。

algorithm - what is the better way to search in millions of file names with wildcard(GLOB) support

2 に答える 2

Related

Reference