python - レーベンシュタイン距離の計算を高速化する方法

Question

シミュレーションを実行して、ランダムなバイナリ文字列間の平均レーベンシュタイン距離をテストしようとしています。

私のプログラムは python ですが、このC 拡張機能を使用しています。関連性があり、ほとんどの時間を費やす関数は、2 つの文字列間のレーベンシュタイン距離を計算するもので、これです。

lev_edit_distance(size_t len1, const lev_byte *string1,
                  size_t len2, const lev_byte *string2,
                  int xcost)
{
  size_t i;
  size_t *row;  /* we only need to keep one row of costs */
  size_t *end;
  size_t half;

  /* strip common prefix */
  while (len1 > 0 && len2 > 0 && *string1 == *string2) {
    len1--;
    len2--;
    string1++;
    string2++;
  }

  /* strip common suffix */
  while (len1 > 0 && len2 > 0 && string1[len1-1] == string2[len2-1]) {
    len1--;
    len2--;
  }

  /* catch trivial cases */
  if (len1 == 0)
    return len2;
  if (len2 == 0)
    return len1;

  /* make the inner cycle (i.e. string2) the longer one */
  if (len1 > len2) {
    size_t nx = len1;
    const lev_byte *sx = string1;
    len1 = len2;
    len2 = nx;
    string1 = string2;
    string2 = sx;
  }
  /* check len1 == 1 separately */
  if (len1 == 1) {
    if (xcost)
      return len2 + 1 - 2*(memchr(string2, *string1, len2) != NULL);
    else
      return len2 - (memchr(string2, *string1, len2) != NULL);
  }
  len1++;
  len2++;
  half = len1 >> 1;
  /* initalize first row */
  row = (size_t*)malloc(len2*sizeof(size_t));
  if (!row)
    return (size_t)(-1);
  end = row + len2 - 1;
  for (i = 0; i < len2 - (xcost ? 0 : half); i++)
    row[i] = i;

  /* go through the matrix and compute the costs.  yes, this is an extremely
   * obfuscated version, but also extremely memory-conservative and relatively
   * fast.  */
  if (xcost) {
    for (i = 1; i < len1; i++) {
      size_t *p = row + 1;
      const lev_byte char1 = string1[i - 1];
      const lev_byte *char2p = string2;
      size_t D = i;
      size_t x = i;
      while (p <= end) {
        if (char1 == *(char2p++))
          x = --D;
        else
          x++;
        D = *p;
        D++;
        if (x > D)
          x = D;
        *(p++) = x;
      }
    }
  }
  else {
    /* in this case we don't have to scan two corner triangles (of size len1/2)
     * in the matrix because no best path can go throught them. note this
     * breaks when len1 == len2 == 2 so the memchr() special case above is
     * necessary */
    row[0] = len1 - half - 1;
    for (i = 1; i < len1; i++) {
      size_t *p;
      const lev_byte char1 = string1[i - 1];
      const lev_byte *char2p;
      size_t D, x;
      /* skip the upper triangle */
      if (i >= len1 - half) {
        size_t offset = i - (len1 - half);
        size_t c3;

        char2p = string2 + offset;
        p = row + offset;
        c3 = *(p++) + (char1 != *(char2p++));
        x = *p;
        x++;
        D = x;
        if (x > c3)
          x = c3;
        *(p++) = x;
      }
      else {
        p = row + 1;
        char2p = string2;
        D = x = i;
      }
      /* skip the lower triangle */
      if (i <= half + 1)
        end = row + len2 + i - half - 2;
      /* main */
      while (p <= end) {
        size_t c3 = --D + (char1 != *(char2p++));
        x++;
        if (x > c3)
          x = c3;
        D = *p;
        D++;
        if (x > D)
          x = D;
        *(p++) = x;
      }
      /* lower triangle sentinel */
      if (i <= half) {
        size_t c3 = --D + (char1 != *char2p);
        x++;
        if (x > c3)
          x = c3;
        *p = x;
      }
    }
  }

  i = *end;
  free(row);
  return i;
}

これは高速化できますか？

AMD FX(tm)-8350 8 コアプロセッサ上の 32 ビット ubuntu でコードを実行します。

これを呼び出す python コードを次に示します。

from Levenshtein import distance
import random
for i in xrange(16):
    sum = 0
    for j in xrange(1000):
        str1 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
        str2 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
        sum += distance(str1,str2)
    print i,sum/(1000*2**i)

score 1 · Accepted Answer

What I'd do:

1) Very small optimization: allocate once and for all row to avoid memory management overhead. Or you may try realloc(), or you could keep track of row's size in a static variable (and have row static as well). This saves very little, however, even if it costs little to put in place.

2) You are trying to calculate an average. Do the average calculation in C as well. This ought to save something in calls. Again, small change, but it comes cheap.

3) Since you're not interested in the actual calculations but only in the results, then, say you have three PC's and each of them is a quad-core machine. Then run on each of them four instances of the program, with the loop being twelve times shorter. You will get twelve results in one twelfth of the time: average those, and Bob's your uncle.

Option #3 requires no modifications at all except for the cycle, and you may want to make it a command line parameter, so that you can deploy the program on a variable number of computers. Actually, you may want to output both the result and its "weight", to minimize chances of errors when you sum the results together.

for j in xrange(N):
    str1 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
    str2 = bin(random.getrandbits(2**i))[2:].zfill(2**i)
    sum += distance(str1,str2)
print N,i,sum/(N*2**i)

But if you're interested in a generic Levenshtein statistic, I'm not so sure that doing the calculation with only 0 and 1 symbols is suitable to your purpose. From the string 01010101, you get 10101010 either by flipping eight characters or by dropping the first and adding a zero at the end, with two different costs. If you have all the letters of the alphabet, the second possibility becomes much less likely, and this ought to change something in the average cost scenario. Or am I missing something?

score 0 · Accepted Answer

他の誰かが 1 年か 2 年前に多くの調査を行い、実行時テストも行いました。

彼はこれを思いつき、基本的にソリューションツリーを使用して作業を高速化しました。

python - レーベンシュタイン距離の計算を高速化する方法

4 に答える 4

Related

Reference