string - Word-wise super string search for given string

Question

For any input string, we need to find super string by word match in any order. i.e. all words in the input string has to occur in any order in output string. e.g. given data set: "string search" "java string search" "manual c string search equals" "java search code" "c java code search" ...

input: "java search" output: 1) "java string search" 2) "java search code" 3) "c java code search"

input: "search c" output: 1) "manual c string search equals" 2) "c java code search"

This can be done in a very trivial way with word by word matching. Here mainly I am looking for an efficient algo.

Input: A few billion records in given data set (mostly 1 to 10 words length string). I need to find super string for millions of strings. Note: words are extended dictionary ones.

score 1 · Accepted Answer

入力を前処理し (可能な場合)、データセットに表示される単語にインデックスを付けます。各単語から可能な出力文字列のセットへのマッピングを生成します。たとえば、データセットでは

0 string search
1 java string search
2 manual c string search equals
3 java search code
4 c java code search

我々が得る

c {2,4}
code {3,4}
equals {2}
java {1,3,4}
...

次に、特定の入力の一致を検索することは、入力単語に対応するセットを交差させるのと同じくらい簡単です。

input: "java c"
output: {1,3,4} intersect {2,4} = {4}

セットを並べ替えられたリストとして保存すると、リストを並行してスキャンすることにより、交差を線形時間 (入力セットの全長に線形) で行うことができます。

score 0 · Accepted Answer

基本的に、input_words と data_words という 2 つの単語セットの共通点を見つける必要があります。交差が input_words と等しい場合、一致しています。

集合交差の効率的なアルゴリズムは次のとおりです。効率的なリスト交差アルゴリズム

私の頭に浮かんだ、O(n*m)[n=サイズ入力、m=サイズデータ]で完結するアルゴリズムが。

パイソン:

match = True
for word in input.split():
  if word in data_words.split(): # linear search comparing word to each word
    continue
  else:
    match = False
    break

ソートされたリストでの検索はより速くなり、ハッシュ検索はさらに速くなります。それらについては、上記のリンクで詳しく説明されています。

string - Word-wise super string search for given string

2 に答える 2

Related

Reference