algorithm - レーベンシュタイン距離を計算する最も効率的な方法

Question

辞書内の文字列に最も近い一致を見つけるために、最適一致ファイル検索アルゴリズムを実装しました。コードのプロファイリングを行った後、クエリと可能な結果との間の距離の計算に圧倒的に多くの時間が費やされていることがわかりました。現在、2 次元配列を使用してレーベンシュタイン距離を計算するアルゴリズムを実装しています。これにより、実装は O(n^2) 演算になります。誰かが同じことをするより速い方法を提案できることを望んでいました。

これが私の実装です：

public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}

score 26 · Accepted Answer

レーベンシュタイン距離に関するウィキペディアのエントリには、計算を最適化するための有用な提案があります。あなたの場合に最も適切なものは、k関心のある最大距離に制限を設けることができる場合(それを超えるものは無限大である可能性があります!)、削減できることです。のO(n times k)代わりに計算を行いますO(n squared)(基本的には、最小可能距離がになるとすぐにあきらめます> k)。

最も近い一致を探しているkので、これまでに見つかった最良の一致の距離まで徐々に減らすことができます。これは、最悪の場合の動作には影響しません (一致は距離の降順である可能性があるため、つまり、すぐに救済されることはありません）が、平均的なケースは改善されるはずです。

大幅に優れたパフォーマンスを得る必要がある場合は、よりおおよその距離を計算する強力な妥協を受け入れる必要があるかもしれません (したがって、必ずしも最適なものではなく、「適度に良い一致」が得られます)。

score 7 · Accepted Answer

このブログへのコメント、Speeding Up Levenshteinによると、VP-Tree を使用して O(nlogn) を実現できます。同じブログの別のコメントは、 VP-Trees と Levenshtein の Python 実装を指しています。これが機能するかどうかお知らせください。

score 3 · Accepted Answer

この投稿にあるレーベンシュタイン距離 VBA 関数を変更して、1 次元配列を使用しました。はるかに高速に実行されます。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

この関数を、長さ 11,304 の文字列 's1' と長さ 5,665 の 's2' でテストしました (> 6,400 万文字の比較)。上記の関数の 1 次元バージョンでは、私のマシンでの実行時間は ~24 秒です。上記のリンクで参照した元の 2 次元関数では、同じ文字列に対して約 37 秒かかります。以下に示すように、1 次元関数をさらに最適化しました。同じ文字列に約 10 秒かかります。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function

score 3 · Accepted Answer

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.

There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

score 2 · Accepted Answer

これが非常に遅いことは承知していますが、目前の議論に関連しています。

他の人が述べたように、2 つの文字列間の編集距離がしきい値 k 内にあるかどうかを確認するだけであれば、時間の複雑さをO(kn)に減らすことができます。より正確な式はO((2k+1)n)です。対角セルの両側にある k 個のセルにまたがるストリップ (ストリップの長さ 2k+1) を取り、このストリップ上にあるセルの値を計算します。

興味深いことに、Li らによる改善がありました。アル。これはさらに O((k+1)n) に縮小されました。

score 2 · Accepted Answer

Commons-lang の実装はかなり高速です。http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htmを参照してください。

それをScalaに翻訳すると、次のようになります。

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}

algorithm - レーベンシュタイン距離を計算する最も効率的な方法

6 に答える 6

Related

Reference