elasticsearch - elasticsearch ファジーマッチング max_expansions & min_similarity

Question

私のプロジェクトでは、主にスペルミスや同じ名前の異なるスペルを見つけるためにあいまい一致を使用しています。エラスティック検索のあいまい一致がどのように機能し、タイトルに記載されている 2 つのパラメーターがどのように使用されるかを正確に理解する必要があります。

私が理解しているように、min_similarityは、クエリされた文字列がデータベース内の文字列と一致する割合です。この値がどのように計算されるかについての正確な説明が見つかりませんでした。

私が理解しているmax_expansionsは、検索を実行するレーベンシュタイン距離です。これが実際にレーベンシュタイン距離である場合、それは私にとって理想的な解決策でした. とにかく、それは機能していません。たとえば、「Samvel」という単語があります

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

ドキュメントには、私が実際に理解していないことが書かれています:

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

ですから、これらのパラメータが検索結果にどのように影響するかを正確に説明してください。

score 25 · Accepted Answer

はmin_similarity0から1の間の値です。Luceneドキュメントから：

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

参照される「編集距離」は、レーベンシュタイン距離です。

このクエリが内部で機能する方法は次のとおりです。

min_similarityを考慮に入れると、検索用語と一致する可能性のあるインデックスに存在するすべての用語が検索されます。
次に、それらすべての用語を検索します。

このクエリがどれほど重いか想像できます！

これに対抗するために、max_expansions考慮すべき一致する用語の最大数を指定するパラメーターを設定できます。

elasticsearch - elasticsearch ファジー マッチング max_expansions & min_similarity

1 に答える 1

Related

Reference

elasticsearch - elasticsearch ファジーマッチング max_expansions & min_similarity