python - python-Levenshtein.ratio の計算方法

Question

python-Levenshtein.ratioソースによると：

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L722

として計算され(lensum - ldist) / lensumます。これは

# pip install python-Levenshtein
import Levenshtein
Levenshtein.distance('ab', 'a') # returns 1
Levenshtein.ratio('ab', 'a')    # returns 0.666666

しかし、それは壊れているようです

Levenshtein.distance('ab', 'ac')  # returns 1
Levenshtein.ratio('ab', 'ac')     # returns 0.5

私は非常に単純なものが欠けているに違いないと感じています..しかし、なぜ0.75ですか？

score 23 · Accepted Answer

'ab'とのレーベンシュタイン距離は次の'ac'とおりです。

したがって、アライメントは次のとおりです。

  a c
  a b

アライメントの長さ = 2
不一致の数 = 1

Levenshtein Distanceこれは、転送(または反転) する1ために必要な置換が 1 つだけであるためです。acab

距離比 = (レーベンシュタイン距離)/(線形の長さ) = 0.5

編集

あなたは書いている

(lensum - ldist) / lensum= (1 - ldist/lensum)= 1 - 0.5 = 0.5。

しかし、これはマッチング(距離ではない)
REFFRENCEです。

Matching %

p = (1 - l/m) × 100

とは次の単語lです。 levenshtein distancemlength of the longest of the two

_{(注意: 一部の著者は 2 つの中で最も長いものを使用しています。私はアラインメントの長さを使用しました)}

(1 - 3/7) × 100 = 57.14...  

  (Word 1    Word 2    RATIO   Mis-Match   Match%
   AB         AB         0       0        (1 - 0/2 )*100  = 100%  
   CD         AB         1       2        (1 - 2/2 )*100  = 0%   
   AB         AC        .5       1        (1 - 1/2 )*100  = 50%

_{レーベンシュタインはギャップを考慮していないため、一部の著者はアラインメントの長さで分割し、他の著者は両方のいずれかの最大長で分割するのはなぜですか? 距離 = 編集数 (挿入 + 削除 + 置換)、標準的なグローバルアラインメントであるNeedleman–Wunsch アルゴリズムはギャップを考慮します。これは、Needleman–Wunsch と Levenshtein の間の (ギャップ) 差です。多くの紙では、2 つのシーケンス間の最大距離が使用されています (ただし、これは私自身の理解であり、100% 確信が持てません) 。}

ここにIEEE TRANSACTIONS ON PAITERN ANALYSISがあります:正規化された編集距離とアプリケーションの計算

2 つの文字列 X と Y が有限のアルファベット上にある場合、X と Y の間の正規化された編集距離 d( X , Y ) は、W( P ) / L ( P )w の最小値として定義されます。ここで、P は間の編集パスです。 X と Y 、W ( P ) は P の基本編集操作の重みの合計であり、L(P) はこれらの操作の数 (P の長さ) です。

score 21 · Accepted Answer

Cコードをより注意深く見ると、この明らかな矛盾はratio、「置換」編集操作を他の操作とは異なる方法で処理する（つまり、コストが2である）ためであることがわかりましdistanceた。 1のコスト。

levenshtein_commonこれは、関数内で行われた内部関数の呼び出しで確認できratio_pyます。

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L727

static PyObject*
ratio_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "ratio", 1, &lensum)) < 0) //Call
    return NULL;

  if (lensum == 0)
    return PyFloat_FromDouble(1.0);

  return PyFloat_FromDouble((double)(lensum - ldist)/(lensum));
}

およびdistance_py機能別：

https://github.com/miohtama/python-Levenshtein/blob/master/Levenshtein.c#L715

static PyObject*
distance_py(PyObject *self, PyObject *args)
{
  size_t lensum;
  long int ldist;

  if ((ldist = levenshtein_common(args, "distance", 0, &lensum)) < 0)
    return NULL;

  return PyInt_FromLong((long)ldist);
}

lev_edit_distanceこれにより、最終的に、次のドキュメントスニペットを持つ別の内部関数に異なるコスト引数が送信されます。

@xcost: If nonzero, the replace operation has weight 2, otherwise all
        edit operations have equal weights of 1.

lev_edit_distance（）のコード：

/**
 * lev_edit_distance:
 * @len1: The length of @string1.
 * @string1: A sequence of bytes of length @len1, may contain NUL characters.
 * @len2: The length of @string2.
 * @string2: A sequence of bytes of length @len2, may contain NUL characters.
 * @xcost: If nonzero, the replace operation has weight 2, otherwise all
 *         edit operations have equal weights of 1.
 *
 * Computes Levenshtein edit distance of two strings.
 *
 * Returns: The edit distance.
 **/
_LEV_STATIC_PY size_t
lev_edit_distance(size_t len1, const lev_byte *string1,
                  size_t len2, const lev_byte *string2,
                  int xcost)
{
  size_t i;

[答え]

だから私の例では、

ratio('ab', 'ac')文字列（4）の全長にわたる置換操作（コスト2）を意味します。したがって、2/4 = 0.5。

それが「方法」を説明しているので、残っているのは「理由」だけだと思いますが、今のところこの理解には満足しています。

score 9 · Accepted Answer

(lensum - ldist) / lensum

ldist は距離ではなく、コストの合計です

一致しない配列の各番号は、上から、左から、または斜めから来ます

数字が左から来れば挿入、上から来れば削除、斜めから来れば置換

挿入と削除のコストは 1 で、置換のコストは 2 です。削除と挿入であるため、置換のコストは 2 です。

ab ac コストは交換なので 2 です

>>> import Levenshtein as lev
>>> lev.distance("ab","ac")
1
>>> lev.ratio("ab","ac")
0.5
>>> (4.0-1.0)/4.0    #Erro, the distance is 1 but the cost is 2 to be a replacement
0.75
>>> lev.ratio("ab","a")
0.6666666666666666
>>> lev.distance("ab","a")
1
>>> (3.0-1.0)/3.0    #Coincidence, the distance equal to the cost of insertion that is 1
0.6666666666666666
>>> x="ab"
>>> y="ac"
>>> lev.editops(x,y)
[('replace', 1, 1)]
>>> ldist = sum([2 for item in lev.editops(x,y) if item[0] == 'replace'])+ sum([1 for item in lev.editops(x,y) if item[0] != 'replace'])
>>> ldist
2
>>> ln=len(x)+len(y)
>>> ln
4
>>> (4.0-2.0)/4.0
0.5

詳細については、python-Levenshtein 比率の計算を参照してください。

もう一つの例：

コストは 9 (4 置換 => 4*2=8 および 1 削除 1*1=1、8+1=9)

str1=len("google") #6
str2=len("look-at") #7
str1 + str2 #13

距離 = 5 (行列のベクトル (7, 6) = 5 による)

比率は (13-9)/13 = 0.3076923076923077 です。

>>> c="look-at"
>>> d="google"
>>> lev.editops(c,d)
[('replace', 0, 0), ('delete', 3, 3), ('replace', 4, 3), ('replace', 5, 4), ('replace', 6, 5)]
>>> lev.ratio(c,d)
0.3076923076923077
>>> lev.distance(c,d)
5

score 4 · Accepted Answer

絶対的な基準はありませんが、正規化されたレーベンシュタイン距離が最も一般的に定義されてldist / max(len(a), len(b))います。どちらの例でも .5 になります。

はmaxレーベンシュタイン距離の上限であるため、意味があります。 afrom bwhereを取得するにはlen(a) > len(b)、いつでもの最初のlen(b)要素をbの対応する要素に置き換えてからa、不足している部分を挿入してa[len(b):]、合計のlen(a)編集操作を行うことができます。

この議論は、明らかな方法でlen(a) <= len(b). 正規化された距離を類似度に変換するには、距離を 1 から引きます1 - ldist / max(len(a), len(b))。

python - python-Levenshtein.ratio の計算方法

4 に答える 4

Related

Reference