string - 文字列類似度 -> レーベンシュタイン距離

Question

レーベンシュタインアルゴリズムを使用して、2 つの文字列の類似性を検出しています。これは私が作成しているプログラムの非常に重要な部分なので、効果的である必要があります。問題は、アルゴリズムが次の例を類似として検出しないことです。

エアコン
_

アルゴリズムは 6 の距離を与えます。したがって、この 6 文字の単語 (文字数が最も多い単語を見てください) の場合、差は 100% => 類似度は 0% です。

2 つの文字列の類似点を見つける方法を見つける必要がありますが、前に提示したようなケースも考慮に入れる必要があります。

私が使用できるより良いアルゴリズムはありますか? または、皆さんは私に何をお勧めしますか？

編集:転置を追加する「Damerau–Levenshtein」アルゴリズムも調べました。問題は、この転置は隣接する文字のみを対象とする (多数の文字を対象とするものではない) ことです。

score 11 · Accepted Answer

用語をユニグラム、バイグラム、トライグラムに分割し、コサイン類似度を計算します。

score 6 · Accepted Answer

これは、文字列の1つ（「conair」など）とそれ自体に一度追加されたもう1つの文字列（「aircon」->「airconaircon」など）にLongest Common Substring /Subsequenceアルゴリズムを使用することで簡単に解決できると思います。

Cのサンプルコード：

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns the length of the longest common substring (LCS)
// between two given strings.
//
// This recursive implementation can be replaced by a
// more performant dynamic programming implementation.
size_t llcs(const char* s1, const char* s2)
{
  size_t len[3];

  if (*s1 == '\0' || *s2 == '\0') return 0;

  len[0] = (*s1 == *s2) + llcs(s1 + 1, s2 + 1);
  len[1] = llcs(s1 + 1, s2);
  len[2] = llcs(s1, s2 + 1);

  if (len[0] < len[1]) len[0] = len[1];
  if (len[0] < len[2]) len[0] = len[2];

  return len[0];
}

// Returns similarity of two given strings in the range
// from 0.0 to 1.0 (1.0 for equal strings).
double similarity(const char* s1, const char* s2)
{
  size_t s1len = strlen(s1);
  size_t s2len = strlen(s2);
  double sim;

  if (s1len == 0 && s2len == 0)
  {
    // Two empty strings are equal
    sim = 1;
  }
  else
  {
    size_t len;
    // Append s1 to itself in s1s1 (e.g. "aircon" -> "airconaircon")
    char* s1s1 = malloc(s1len * 2 + 1);
    strcpy(s1s1, s1);
    strcpy(s1s1 + s1len, s1);

    // Find the length of the LCS between s1s1 and s2
    // (e.g. between "airconaircon" and "conair")
    len = llcs(s1s1, s2);
    // We need it not longer than s1 (e.g. "aircon")
    // since we're actually comparing s1 and s2
    if (len > s1len) len = s1len;

    len *= 2;

    // Prevent 100% similarity between a string and its
    // cyclically shifted version (e.g. "aircon" and "conair")
    if (len == s1len + s2len && strcmp(s1, s2) != 0) len--;

    // Get the final measure of the similarity
    sim = (double)len / (s1len + s2len);

    free(s1s1);
  }

  return sim;
}

int main(int argc, char** argv)
{
  if (argc == 3)
    printf("Similarity of \"%s\" and \"%s\" is %.2f%%\n",
           argv[1], argv[2], 100 * similarity(argv[1], argv[2]));
  else
    printf("Usage:\n  %s string1 string2\n",
           argv[0]);
  return 0;
}

サンプル出力：

Similarity of "123" and "123" is 100.00%
Similarity of "123" and "1234" is 85.71%
Similarity of "0123" and "123" is 85.71%
Similarity of "a" and "aa" is 66.67%
Similarity of "aa" and "a" is 66.67%
Similarity of "aaaaaaa" and "aaaaaa" is 92.31%
Similarity of "aaaaaa" and "aaaaaaa" is 92.31%
Similarity of "aircon" and "conair" is 91.67%
Similarity of "spit" and "pits" is 87.50%
Similarity of "pits" and "spit" is 87.50%
Similarity of "spits" and "pits" is 88.89%
Similarity of "pits" and "spits" is 88.89%

score 2 · Accepted Answer

文字の代わりに音節や音素を使ってレーベンシュタイン距離を試してみたくなるかもしれません。

score 2 · Accepted Answer

理論的には、使用しているアプローチは、解決しようとしている問題に対して正しいものです。しかし、レベンスタインは個々のキャラクターだけを2つのセットと見なします。

文字列の類似性は、最長共通部分列法を使用して見つけることもできます。その後、一致しない残りの部分でレーベンシュタインを見ることができます。

クラスター化されたアプローチを実行したい場合、次の回答にはいくつかの詳細があるようですが、明らかに実装するのはより困難です。

score 2 · Accepted Answer

sorenson、jaccard、jaro_winkler などの他の類似性尺度を使用してみてください

個人的には、ジャロ・ウィンクラーの大ファンです。

from Levenshtein import jaro_winkler
In [2]: jaro_winkler("conair","aircon")
Out[2]: 0.8333333333333334

score 2 · Accepted Answer

単語を並べ替えてレーベンシュタインを見つけると、例と 100% 一致しますが、たとえば、次の場合も 100% 一致します。

CONAIR
RCIAON

これはあなたが望むものではないかもしれません。

類似性を定義するもう 1 つの方法は、2 つの文字列に共通する部分文字列を見つけることです。サフィックスツリーを作成し、すべての一般的な部分文字列を見つけて、それらがどの程度類似しているかを判断することができます。したがって、たとえば、接尾辞ツリーは、単語全体をカバーする CON & AIR として共通の部分文字列を提供し (2 つの文字列)、類似していると結論付けます。

string - 文字列類似度 -> レーベンシュタイン距離

7 に答える 7

Related

Reference