java - すべての距離を計算しないようにレーベンシュタイン距離アルゴリズムを変更する

Question

私はあいまい検索の実装に取り組んでおり、実装の一部として、Apache の StringUtils.getLevenshteinDistance を使用しています。現時点では、あいまい検索の特定の最大平均応答時間を目指しています。さまざまな機能強化といくつかのプロファイリングの後、最も時間が費やされる場所は、レーベンシュタイン距離の計算です。3 文字以上の検索文字列では、合計時間の約 80 ～ 90% を占めます。

さて、ここでできることにはいくつかの制限があることはわかっていますが、以前の SO の質問と LD のウィキペディアのリンクを読んだことがあります。しきい値を設定された最大距離に制限したい場合は、アルゴリズムに費やした時間ですが、これを正確に行う方法がわかりません。

距離がしきい値 k より小さい場合にのみ距離に関心がある場合は、マトリックスで幅 2k+1 の斜めストライプを計算するだけで十分です。このようにして、アルゴリズムは O(kl) 時間で実行できます。ここで、l は最短の文字列の長さです [3]。

以下に、StringUtils の元の LH コードを示します。後は私の改造です。基本的に、i,j 対角線から一定の長さの距離を計算しようとしています (したがって、私の例では、i,j 対角線の上下にある 2 つの対角線)。ただし、これは私が行ったので正しくありません。たとえば、最も高い対角線では、常に真上のセル値が選択されます。これは 0 になります。説明したようにこれを機能させる方法、またはその方法に関する一般的なアドバイスを誰かが教えてくれたら、それは大歓迎です。

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

私の変更（forループのみ）：

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

score 5 · Accepted Answer

O(n)時間前にこの種のチェックを行う1つの方法であるレーベンシュタインオートマトンについて書いた. ソースコードのサンプルは Python で書かれていますが、説明は役に立ち、参考文献には詳細が記載されています。

score 5 · Accepted Answer

ウィンドウを実装する際の問題は、各行の最初のエントリの左側と最後のエントリの上にある値を処理することです。

1 つの方法は、最初に入力する値を 0 ではなく 1 から開始し、遭遇した 0 を無視することです。最終的な答えから 1 を引く必要があります。

もう 1 つの方法は、first の左側と last の上のエントリに高い値を入力して、最小チェックでそれらが選択されないようにすることです。それが、先日実装しなければならなかったときに私が選んだ方法です。

public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

score 3 · Accepted Answer

「Gusfield、Dan（1997）。文字列、ツリー、およびシーケンスのアルゴリズム：コンピューターサイエンスと計算生物学」（264ページ）によると、ゼロは無視する必要があります。

score 1 · Accepted Answer

ここで誰かが非常によく似た質問に答えます：

引用：
私は何度もそれをしました。私がそれを行う方法は、可能な変更のゲームツリーの再帰的な深さ優先のツリーウォークです。木を剪定するために使用する変更の予算 k があります。そのルーチンを手元に置いて、最初に k=0、次に k=1、次に k=2 で実行し、ヒットするか、それ以上上がらなくなるまで続けます。

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}

主要部分のみ、元のコードはより多く

score 1 · Accepted Answer

元のコードを使用して、これを j for ループの最後の直前に配置します。

    if (p[n] > s.length() + 5)
        break;

+5 は任意ですが、距離がクエリの長さに 5 を加えた値 (または任意の数値) である場合、一致が単に違いすぎると見なされるため、何が返されるかは問題ではありません。それは物事を少し減らします。それでも、誰かがそれをよりよく理解しているなら、これはWikiの声明が話していた考えではないことは確かです.

java - すべての距離を計算しないようにレーベンシュタイン距離アルゴリズムを変更する

6 に答える 6

Related

Reference