c# - 指定された2つの文字列の距離類似度を計算する方法は?

Question

2 つの文字列の類似性を計算する必要があります。では、正確にはどういう意味ですか？例を挙げて説明しましょう：

本当の言葉：hospital
間違った単語:haspita

ここでの目的は、間違った単語を修正して本物の単語を取得するために必要な文字数を決定することです。この例では、2 文字を変更する必要があります。では、何パーセントになるでしょうか？私はいつも本当の言葉の長さを取ります。したがって、2 / 8 = 25% となり、これら 2 つの指定された文字列 DSM は 75% になります。

パフォーマンスを重要な考慮事項として、これをどのように達成できますか?

score 75 · Accepted Answer

数週間前、まったく同じ問題に対処しました。誰かが今尋ねているので、コードを共有します。私の徹底的なテストでは、最大距離が指定されていない場合でも、コードはウィキペディアの C# の例よりも約 10 倍高速です。最大距離が指定されている場合、このパフォーマンスの向上は 30x - 100x + に増加します。パフォーマンスに関するいくつかの重要なポイントに注意してください。

同じ単語を何度も比較する必要がある場合は、最初に単語を整数の配列に変換します。Damerau-Levenshtein アルゴリズムには、多くの >、<、== 比較が含まれており、ints比較よりもはるかに高速ですchars。
距離が指定された最大値を超えた場合に終了する短絡メカニズムが含まれています
私が他の場所で見たすべての実装のように、大規模な行列ではなく、3 つの配列のローテーションセットを使用します。
配列が短いワード幅でスライスされていることを確認してください。

コード (パラメーター宣言で次のように置き換えint[]ても、まったく同じように機能します。String

/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {

    int length1 = source.Length;
    int length2 = target.Length;

    // Return trivial case - difference in string lengths exceeds threshhold
    if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }

    // Ensure arrays [i] / length1 use shorter length 
    if (length1 > length2) {
        Swap(ref target, ref source);
        Swap(ref length1, ref length2);
    }

    int maxi = length1;
    int maxj = length2;

    int[] dCurrent = new int[maxi + 1];
    int[] dMinus1 = new int[maxi + 1];
    int[] dMinus2 = new int[maxi + 1];
    int[] dSwap;

    for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }

    int jm1 = 0, im1 = 0, im2 = -1;

    for (int j = 1; j <= maxj; j++) {

        // Rotate
        dSwap = dMinus2;
        dMinus2 = dMinus1;
        dMinus1 = dCurrent;
        dCurrent = dSwap;

        // Initialize
        int minDistance = int.MaxValue;
        dCurrent[0] = j;
        im1 = 0;
        im2 = -1;

        for (int i = 1; i <= maxi; i++) {

            int cost = source[im1] == target[jm1] ? 0 : 1;

            int del = dCurrent[im1] + 1;
            int ins = dMinus1[i] + 1;
            int sub = dMinus1[im1] + cost;

            //Fastest execution for min value of 3 integers
            int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);

            if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
                min = Math.Min(min, dMinus2[im2] + cost);

            dCurrent[i] = min;
            if (min < minDistance) { minDistance = min; }
            im1++;
            im2++;
        }
        jm1++;
        if (minDistance > threshold) { return int.MaxValue; }
    }

    int result = dCurrent[maxi];
    return (result > threshold) ? int.MaxValue : result;
}

どこSwapにある：

static void Swap<T>(ref T arg1,ref T arg2) {
    T temp = arg1;
    arg1 = arg2;
    arg2 = temp;
}

score 52 · Accepted Answer

探しているものは編集距離またはレーベンシュタイン距離と呼ばれます。ウィキペディアの記事では、計算方法が説明されており、C# でこのアルゴリズムを非常に簡単にコーディングできるように、下部にすばらしい擬似コードがあります。

以下にリンクされている最初のサイトからの実装を次に示します。

private static int  CalcLevenshteinDistance(string a, string b)
    {
    if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
        return 0;
    }
    if (String.IsNullOrEmpty(a)) {
        return b.Length;
    }
    if (String.IsNullOrEmpty(b)) {
        return a.Length;
    }
    int  lengthA   = a.Length;
    int  lengthB   = b.Length;
    var  distances = new int[lengthA + 1, lengthB + 1];
    for (int i = 0;  i <= lengthA;  distances[i, 0] = i++);
    for (int j = 0;  j <= lengthB;  distances[0, j] = j++);

    for (int i = 1;  i <= lengthA;  i++)
        for (int j = 1;  j <= lengthB;  j++)
            {
            int  cost = b[j - 1] == a[i - 1] ? 0 : 1;
            distances[i, j] = Math.Min
                (
                Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
                distances[i - 1, j - 1] + cost
                );
            }
    return distances[lengthA, lengthB];
    }

score 37 · Accepted Answer

使用できる文字列類似距離アルゴリズムは多数あります。ここにリストされている一部 (完全にリストされていません):

レベンスタイン
ニードルマン・ウンチ
スミス・ウォーターマン
スミス・ウォーターマン・ゴトー
ジャロ、ジャロ・ウィンクラー
ジャカードの類似性
ユークリッド距離
サイコロの類似度
コサイン類似度
モンジ・エルカン

これらすべての実装を含むライブラリは、Java と C# の両方の実装を持つSimMetricsと呼ばれます。

score 24 · Accepted Answer

LevenshteinとJaro Winklerは、次のような文字列間の小さな違いに最適であることがわかりました。

スペルミス; また
人名のoの代わりにö。

ただし、記事のタイトルのようなものを比較すると、テキストの大部分は同じですが、端に「ノイズ」があります。Smith-Waterman-Gotohは素晴らしいものでした。

これらの 2 つのタイトルを比較してください (同じものですが、ソースが異なると言葉遣いが異なります)。

紫外線照射された DNA に 1 本のポリヌクレオチド鎖の切断を導入する大腸菌由来のエンドヌクレアーゼ

エンドヌクレアーゼ III:紫外線照射した DNA に単一のポリヌクレオチド鎖切断を導入する大腸菌由来のエンドヌクレアーゼ

文字列のアルゴリズム比較を提供するこのサイトは、次のことを示しています。

レーベンシュタイン：81
スミスウォーターマンゴトー 94
ジャロ・ウィンクラー 78

Jaro Winkler と Levenshtein は、Smith Waterman Gotoh ほど類似性を検出する能力がありません。同じ記事ではないが、テキストが一致する 2 つのタイトルを比較すると、次のようになります。

高等植物における脂肪代謝。アシル補酵素Aおよびアシルアシル担体タンパク質の代謝におけるアシルチオエステラーゼの機能

高等植物における脂肪代謝。複雑な脂質混合物中のアシル-アシルキャリアタンパク質とアシル補酵素Aの測定

Jaro Winkler は偽陽性を示しますが、Smith Waterman Gotoh はそうではありません。

レーベンシュタイン：54
スミス・ウォーターマン・ゴトー 49
ジャロ・ウィンクラー 89

Anastasiosyalが指摘したように、SimMetricsにはこれらのアルゴリズムの Java コードがあります。SimMetrics のSmithWatermanGotoh Java コードを使用して成功しました。

score 7 · Accepted Answer

これは、類似度係数だけでなく、修正された単語のエラー位置も返します (この機能はテキストエディターで使用できます)。また、私の実装では、さまざまな重みのエラー (置換、削除、挿入、転置) がサポートされています。

public static List<Mistake> OptimalStringAlignmentDistance(
  string word, string correctedWord,
  bool transposition = true,
  int substitutionCost = 1,
  int insertionCost = 1,
  int deletionCost = 1,
  int transpositionCost = 1)
{
    int w_length = word.Length;
    int cw_length = correctedWord.Length;
    var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
    var result = new List<Mistake>(Math.Max(w_length, cw_length));

    if (w_length == 0)
    {
        for (int i = 0; i < cw_length; i++)
            result.Add(new Mistake(i, CharMistakeType.Insertion));
        return result;
    }

    for (int i = 0; i <= w_length; i++)
        d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);

    for (int j = 0; j <= cw_length; j++)
        d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);

    for (int i = 1; i <= w_length; i++)
    {
        for (int j = 1; j <= cw_length; j++)
        {
            bool equal = correctedWord[j - 1] == word[i - 1];
            int delCost = d[i - 1, j].Key + deletionCost;
            int insCost = d[i, j - 1].Key + insertionCost;
            int subCost = d[i - 1, j - 1].Key;
            if (!equal)
                subCost += substitutionCost;
            int transCost = int.MaxValue;
            if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
            {
                transCost = d[i - 2, j - 2].Key;
                if (!equal)
                    transCost += transpositionCost;
            }

            int min = delCost;
            CharMistakeType mistakeType = CharMistakeType.Deletion;
            if (insCost < min)
            {
                min = insCost;
                mistakeType = CharMistakeType.Insertion;
            }
            if (subCost < min)
            {
                min = subCost;
                mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
            }
            if (transCost < min)
            {
                min = transCost;
                mistakeType = CharMistakeType.Transposition;
            }

            d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
        }
    }

    int w_ind = w_length;
    int cw_ind = cw_length;
    while (w_ind >= 0 && cw_ind >= 0)
    {
        switch (d[w_ind, cw_ind].Value)
        {
            case CharMistakeType.None:
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Substitution:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
                w_ind--;
                cw_ind--;
                break;
            case CharMistakeType.Deletion:
                result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
                w_ind--;
                break;
            case CharMistakeType.Insertion:
                result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
                cw_ind--;
                break;
            case CharMistakeType.Transposition:
                result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
                w_ind -= 2;
                cw_ind -= 2;
                break;
        }
    }
    if (d[w_length, cw_length].Key > result.Count)
    {
        int delMistakesCount = d[w_length, cw_length].Key - result.Count;
        for (int i = 0; i < delMistakesCount; i++)
            result.Add(new Mistake(0, CharMistakeType.Deletion));
    }

    result.Reverse();

    return result;
}

public struct Mistake
{
    public int Position;
    public CharMistakeType Type;

    public Mistake(int position, CharMistakeType type)
    {
        Position = position;
        Type = type;
    }

    public override string ToString()
    {
        return Position + ", " + Type;
    }
}

public enum CharMistakeType
{
    None,
    Substitution,
    Insertion,
    Deletion,
    Transposition
}

このコードは、私のプロジェクトYandex-Linguistics.NETの一部です。

私はいくつかのテストを書きましたが、その方法は機能しているようです。

しかし、コメントやコメントは大歓迎です。

score 4 · Accepted Answer

別のアプローチを次に示します。

類似性を見つけるための典型的な方法はレーベンシュタイン距離であり、利用可能なコードを持つライブラリは間違いありません。

残念ながら、これにはすべての文字列を比較する必要があります。距離があるしきい値よりも大きい場合は、特別なバージョンのコードを記述して計算を短縮できる場合がありますが、それでもすべての比較を行う必要があります。

もう 1 つのアイデアは、トライグラムまたは n グラムの変形を使用することです。これらは n 文字 (または n 単語または n ゲノム配列または n 何でも) のシーケンスです。トリグラムから文字列へのマッピングを維持し、重複が最も大きいものを選択します。n の一般的な選択は「3」であるため、この名前が付けられています。

たとえば、英語には次のトライグラムがあります。

英語
ngl
グリ
リス
Hは

そして、イングランドは次のようになります。

英語
ngl
グラ
ラン
と

まあ、7 のうち 2 (または 10 のうち 4) 一致します。これが機能する場合は、トライグラム/文字列テーブルにインデックスを付けて検索を高速化できます。

これをレーベンシュタインと組み合わせて、共通の n グラムの最小数を持つ比較セットを減らすこともできます。

score 0 · Accepted Answer

VB.net の実装は次のとおりです。

Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
    Dim cost(v1.Length, v2.Length) As Integer
    If v1.Length = 0 Then
        Return v2.Length                'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
    ElseIf v2.Length = 0 Then
        Return v1.Length                'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
    Else
        'setup the base costs for inserting the correct characters
        For v1Count As Integer = 0 To v1.Length
            cost(v1Count, 0) = v1Count
        Next v1Count
        For v2Count As Integer = 0 To v2.Length
            cost(0, v2Count) = v2Count
        Next v2Count
        'now work out the cheapest route to having the correct characters
        For v1Count As Integer = 1 To v1.Length
            For v2Count As Integer = 1 To v2.Length
                'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
                'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1), 
                'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and 
                cost(v1Count, v2Count) = Math.Min(
                    cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
                    Math.Min(
                        cost(v1Count - 1, v2Count) + 1,
                        cost(v1Count, v2Count - 1) + 1
                    )
                )
            Next v2Count
        Next v1Count

        'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
        'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
        Return cost(v1.Length, v2.Length)
    End If
End Function

c# - 指定された2つの文字列の距離類似度を計算する方法は?

7 に答える 7

Related

Reference