c# - 2つの文字列間の類似性を測定するにはどうすればよいですか？

Question

与えられた2つの文字列text1とtext2：

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}

例：

最初の文字列：StackOverflow

2番目の文字列：StaqOverflow

リターン：類似性は91％です

戻り値は％またはそのようなものにすることができます。
最初の文字列：単純なテキストテスト

2番目の文字列：複雑なテキストテスト

戻り値：値は等しいと見なすことができます

何か案は？これを行うための最良の方法は何ですか？

score 42 · Accepted Answer

これを行うにはさまざまな方法があります。アルゴリズムを使用した他のページへのリンクについては、ウィキペディアの「文字列類似度」ページを参照してください。

ただし、これらのアルゴリズムのいずれも音を考慮していないと思います。したがって、「staqオーバーフロー」は、最初の発音が似ているにもかかわらず、「スタックオーバーフロー」と「stawオーバーフロー」に似ています。

かなり多くのオプションを提供する別のページを見つけました...特に、Soundexアルゴリズム（Wikipedia）はあなたが求めているものに近いかもしれません。

score 27 · Accepted Answer

27

レーベンシュタイン距離はおそらくあなたが探しているものです。

于 2009-06-23T19:29:05.507 に答える

score 14 · Accepted Answer

これが私が取り組んでいるプロジェクトのために書いたコードです。文字列の類似度と文字列の単語に基づく類似度を知る必要があります。この最後の1つは、最小の文字列の単語類似度（すべての単語が存在し、大きい文字列で一致する場合、結果は100％になります）と大きい文字列の単語類似度（私はRealWordsRatioと呼びます）の両方を知りたいです。）。レーベンシュタインアルゴリズムを使用して距離を見つけます。これまでのところ、コードは最適化されていませんが、期待どおりに機能します。お役に立てば幸いです。

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }

score 5 · Accepted Answer

しばらく前に、C＃でDoubleMetaphoneの実装を作成しました。Soundexなどよりもはるかに優れていることがわかります。

レーベンシュタイン距離も提案されており、これは多くの用途に最適なアルゴリズムですが、音声マッチングは実際には機能しません。音声的に類似した単語も通常同じように綴られるため、そのように見えることがあります。さまざまなあいまいマッチングアルゴリズムの分析を行いましたが、これも役立つ場合があります。

score 4 · Accepted Answer

'sound alikes'に対処するには、DoubleMetaphoneやsoundexなどの音声アルゴリズムを使用したエンコーディングを調べることをお勧めします。音声でエンコードされた文字列でレーベンシュタイン距離を計算することが有益かどうかはわかりませんが、可能性はあります。または、次のようなヒューリスティックを使用することもできます。文字列内の各単語をエンコードされた形式に変換し、両方の文字列に含まれる単語を削除して、レーベンシュタイン距離を計算する前に単一の表現に置き換えます。

score 3 · Accepted Answer

3

文字列「距離」、たとえばレーベンシュタイン距離を探すことができます。

于 2009-06-23T19:29:07.410 に答える

score 3 · Accepted Answer

3

PerlモジュールText::Phoneticには、さまざまなアルゴリズムの実装があります。

于 2009-06-23T19:40:22.223 に答える

score 2 · Accepted Answer

SQLデータベースの値を比較する場合は、SOUNDEX関数を使用できます。GoogleにSOUNDEXとC＃を問い合わせると、それとVBに対して同様の関数を作成した人もいます。

score 2 · Accepted Answer

Jeff Atwoodは、検索を絞り込むのに役立つ可能性のあるwiki投稿の作成者を決定するための同様のソリューションを探すことについて書いています。

score 1 · Accepted Answer

私もSoundexをお勧めする必要があります。過去に、スペルミスのある都市名を処理するためにSoundexを使用しました。使用法の良いリンクは次のとおりです：http ：//whitepapers.zdnet.com/abstract.aspx？docid = 352953

score 1 · Accepted Answer

音声で比較したい場合は、SoundexとMetaphoneのアルゴリズムを確認してください：http ：//www.blackbeltcoder.com/Articles/algorithms/phonetic-string-comparison-with-soundex

score 1 · Accepted Answer

Metaphone 3は、Metaphoneアルゴリズムの第3世代です。北米で最も一般的な英語の単語、名前、および英語以外の単語のデータベースに対してテストした場合、音声エンコーディングの精度がDouble Metaphoneの89％から98％に向上します。これにより、アメリカの発音に対して非常に信頼性の高い音声エンコーディングが生成されます。

Metaphone 3は、オリジナルのMetaphoneおよびDoubleMetaphoneアルゴリズムを設計および開発したLawrencePhilipsによって設計および開発されました。

c# - 2つの文字列間の類似性を測定するにはどうすればよいですか？

12 に答える 12

Related

Reference