python - このダメラウ・レーベンシュタインの実装のバグを修正するにはどうすればよいですか？

Question

私は別の長い質問で戻ってきました。PythonベースのDamerau-Levenshtein編集距離の実装をいくつか試した結果、最終的に以下にリストされているものが見つかりましたeditdistance_reference()。正しい結果が得られ、効率的に実装されているようです。

そこで、コードをCythonに変換することにしました。私のテストデータでは、参照メソッドは11,000回の比較（12文字の長さの単語のペアの場合）の結果を提供しますが、Cythonizedメソッドは1秒あたり200,000回を超える比較を実行します。残念ながら、結果は正しくありません。デバッグ用に出力した変数を見ると、thisrow どのデータをスローしても、私のバージョンでは変数がいっぱいになっていますが、参照出力には別の画像が表示されます。たとえば、'helo'に対してテストすると'world' 、次の出力が生成されます（ED私の関数をマークEDRし、正しく機能する参照です）。

差出人editdistance()：

#ED  A [0, 0, 0, 0, 0, 1]
#ED  B [1, 0, 0, 0, 0, 1]
#ED  B [1, 1, 0, 0, 0, 1]
#ED  B [1, 1, 1, 0, 0, 1]
#ED  B [1, 1, 1, 1, 0, 1]
#ED  B [1, 1, 1, 1, 1, 1]

#ED  A [0, 0, 0, 0, 0, 2]
#ED  B [1, 0, 0, 0, 0, 2]
#ED  B [1, 1, 0, 0, 0, 2]
#ED  B [1, 1, 1, 0, 0, 2]
#ED  B [1, 1, 1, 1, 0, 2]
#ED  B [1, 1, 1, 1, 1, 2]

#ED  A [0, 0, 0, 0, 0, 3]
#ED  B [1, 0, 0, 0, 0, 3]
#ED  B [1, 1, 0, 0, 0, 3]
#ED  B [1, 1, 1, 0, 0, 3]
#ED  B [1, 1, 1, 1, 0, 3]
#ED  B [1, 1, 1, 1, 1, 3]

#ED  A [0, 0, 0, 0, 0, 4]
#ED  B [1, 0, 0, 0, 0, 4]
#ED  B [1, 1, 0, 0, 0, 4]
#ED  B [1, 1, 1, 0, 0, 4]
#ED  B [1, 1, 1, 1, 0, 4]
#ED  B [1, 1, 1, 1, 1, 4]

からeditdistance_reference()：

#EDR A [0, 0, 0, 0, 0, 1]
#EDR B [1, 0, 0, 0, 0, 1]
#EDR B [1, 2, 0, 0, 0, 1]
#EDR B [1, 2, 3, 0, 0, 1]
#EDR B [1, 2, 3, 4, 0, 1]
#EDR B [1, 2, 3, 4, 5, 1]

#EDR A [0, 0, 0, 0, 0, 2]
#EDR B [2, 0, 0, 0, 0, 2]
#EDR B [2, 2, 0, 0, 0, 2]
#EDR B [2, 2, 3, 0, 0, 2]
#EDR B [2, 2, 3, 4, 0, 2]
#EDR B [2, 2, 3, 4, 5, 2]

#EDR A [0, 0, 0, 0, 0, 3]
#EDR B [3, 0, 0, 0, 0, 3]
#EDR B [3, 3, 0, 0, 0, 3]
#EDR B [3, 3, 3, 0, 0, 3]
#EDR B [3, 3, 3, 3, 0, 3]
#EDR B [3, 3, 3, 3, 4, 3]

#EDR A [0, 0, 0, 0, 0, 4]
#EDR B [4, 0, 0, 0, 0, 4]
#EDR B [4, 4, 0, 0, 0, 4]
#EDR B [4, 4, 4, 0, 0, 4]
#EDR B [4, 4, 4, 4, 0, 4]
#EDR B [4, 4, 4, 4, 4, 4]

エラーはおそらくそれらの非常に明白なものの1つであるため、私は非常に愚かである必要があります。しかし、私はそれを見つけることができないようです。

2番目の問題がありmallocます。3つの配列、、、およびのスペースtwoagoがoneagoあれthisrowば、それらは循環的に入れ替わります。などを実行しようとするfree( twoago )と、glibcが文句を言う行が表示されdouble free or corruptionます。私はそれをグーグルで検索しました。ポインタ交換ビジネスによってglibcが少しめまいを起こし、メモリを正しく解放できなくなる可能性がありますか？

以下に、最初にsetup.pyコンパイル（）を実行するために必要なものをリストし/path/to/python3.1 ./setup.py build_ext --inplace、次に適切な距離コードを編集するので、興味のある人は複製が簡単であることがわかります。

もう1つ：これはPython3.1で実行されます。面白いことに、*.pyxファイル内には裸のUnicode文字列がprintありますが、それでも関数ではなくステートメントです。

はい、これはここに貼り付けるコードがたくさんあることは知っていますが、切り詰めすぎるとコードを実行できなくなります。私はeditdistance()正しく機能することを除いてすべての方法を信じていますが、あなたが感じる問題を自由に指摘してください。

setup.py：

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
  name            = 'cython_dameraulevenshtein',
  ext_modules     = [
    Extension( 'cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ] ), ],
  cmdclass        = {
    'build_ext': build_ext }, )

cython_dameraulevenshtein.pyx（最後までスクロールして、興味深いものを確認してください）：

############################################################################################################
cdef extern from "stdlib.h":
  ctypedef  unsigned int size_t
  void      *malloc(size_t size)
  void      *realloc( void *ptr, size_t size )
  void      free(void *ptr)

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
  if a < b: return a
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
  if a < b:
    if c < a:
      return c
    return a
  if c < b:
    return c
  return b

#-----------------------------------------------------------------------------------------------------------
cdef inline int _warp( unsigned int limit, int value ):
  return value if value >= 0 else limit + value

############################################################################################################
# ARRAYS THAT SAY SIZE ;-)
#-----------------------------------------------------------------------------------------------------------
cdef class Array_of_unsigned_int:
  cdef unsigned int *data
  cdef unsigned int length

  #---------------------------------------------------------------------------------------------------------
  def __cinit__( self, unsigned int length, fill_value = None ):
    self.length = length
    self.data   = <unsigned int *>malloc( length * sizeof( unsigned int ) )  ###OBS### must check malloc doesn't return NULL pointer
    if fill_value is not None:
      self.fill( fill_value )

  #---------------------------------------------------------------------------------------------------------
  cdef fill( self, unsigned int value ):
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      d[ idx ] = value

  #---------------------------------------------------------------------------------------------------------
  cdef resize( self, unsigned int length ):
    self.data   = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) )  ###OBS### must check realloc doesn't return NULL pointer
    self.length = length

  #---------------------------------------------------------------------------------------------------------
  def free( self ):
    """Always remember the milk: Free up memory."""
    free( self.data )  ###OBS### should free memory here

  #---------------------------------------------------------------------------------------------------------
  def as_list( self ):
    """Return the array as a Python list."""
    R                       = []
    cdef unsigned int idx
    cdef unsigned int *d    = self.data
    for idx from 0 <= idx < self.length:
      R.append( d[ idx ] )
    return R


############################################################################################################
# CONVERTING UNICODE TO CHARACTER IDs (CIDs)
#---------------------------------------------------------------------------------------------------------
cdef unsigned int _UMX_surrogate_lower_bound    = 0x10000
cdef unsigned int _UMX_surrogate_upper_bound    = 0x10ffff
cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
cdef unsigned int _UMX_surrogate_foobar_factor  = 0x400

#---------------------------------------------------------------------------------------------------------
cdef Array_of_unsigned_int _cids_from_text( text ):
  """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
  ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
  each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
  the array returned may be less than ``len( text )``."""
  #.........................................................................................................
  # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
  is_bytes = isinstance( text, ( bytes, bytearray, ) )
  assert is_bytes or isinstance( text, str ), '#121'
  #.........................................................................................................
  # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
  # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
  # text, there may be fewer CIDs if there happen to be surrogate characters):
  cdef unsigned int           length  = <unsigned int>len( text )
  cdef Array_of_unsigned_int  R       = Array_of_unsigned_int( length )
  #.........................................................................................................
  # If ``text`` is empty, we can return an empty array right away:
  if length == 0: return R
  #.........................................................................................................
  # Otherwise, prepare to copy data:
  cdef unsigned int idx               = 0
  #.........................................................................................................
  # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
  # values and are done:
  if is_bytes:
    for idx from 0 <= idx < length:
      R.data[ idx ] = <unsigned int>text[ idx ]
    return R
  #.........................................................................................................
  cdef unsigned int cid               = 0
  cdef bool         is_surrogate      = False
  cdef unsigned int hi                = 0
  cdef unsigned int lo                = 0
  cdef unsigned int chr_count         = 0
  #.........................................................................................................
  # Iterate over all indexes in text:
  for idx from 0 <= idx < length:
    #.......................................................................................................
    # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
    # corresponding low CID is on the current position. Having both, we can compute the intended CID
    # and reset the flag:
    if is_surrogate:
      lo = <unsigned int>ord( text[ idx ] )
      # IIRC, this formula was documented in Unicode 3:
      cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
            + ( lo - _UMX_surrogate_lo_lower_bound ) + _UMX_surrogate_lower_bound )
      is_surrogate = False
    #.......................................................................................................
    else:
      # Otherwise, we retrieve the CID from the current position:
      cid = <unsigned int>ord( text[ idx ] )
      #.....................................................................................................
      if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
        # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
        # in the next cycle:
        hi                = cid
        is_surrogate      = True
        continue
    #.......................................................................................................
    R.data[ chr_count ] = cid
    chr_count     += 1
  #.........................................................................................................
  # Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may
  # have fewer elements than the naive string length indicated; in this case, we want to free some memory
  # and correct array length data:
  if chr_count != length:
    R.resize( chr_count )
  #.........................................................................................................
  return R

#---------------------------------------------------------------------------------------------------------
def cids_from_text( text ):
  cdef Array_of_unsigned_int c_R  =_cids_from_text( text )
  R                               = c_R.as_list()
  c_R.free() ###OBS### should free memory here
  return R


############################################################################################################
# SECOND-ORDER SIMILARITY
#-----------------------------------------------------------------------------------------------------------
cpdef float similarity( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between
  0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance( a, b )``, so a result of ``1.0``
  indicates identity, while ``0.0`` indicates complete dissimilarity."""
  return 1.0 - relative_editdistance( a, b )

#-----------------------------------------------------------------------------------------------------------
cpdef float relative_editdistance( char *a, char *b ):
  """Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return
  value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the
  length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete
  dissimilarity."""
  cdef int length = max( len( a ), len( b ) )
  if length == 0: return 0.0
  return editdistance( a, b ) / <float>length

############################################################################################################
# EDIT DISTANCE
#-----------------------------------------------------------------------------------------------------------
cpdef unsigned int editdistance( text_a, text_b ):
  """Given texts as Unicode strings or ``bytes`` / ``bytearray`` objects, return their absolute
  Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one
  difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is
  ``1``."""
  #.........................................................................................................
  # This should be fast in Python, as it can (and probably is) implemented by doing an identity check in
  # the case of ``bytes`` and ``str`` objects:
  if text_a == text_b: return 0
  #.........................................................................................................
  # Convert Unicode text to C array of unsigned integers:
  cdef Array_of_unsigned_int a  = _cids_from_text( text_a )
  cdef Array_of_unsigned_int b  = _cids_from_text( text_b )
  R                             = c_editdistance( a, b )
  #.........................................................................................................
  # Always remember the milk:
  a.free()
  b.free()
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
cdef unsigned int c_editdistance( Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b ):
  # Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix.
  # However, only the current and two previous rows are needed at once,
  # so we only store those.
  #.........................................................................................................
  # This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function
  # that deals with the Python objects, q.v.
  # if cids_a.equals( cids_b ): return 0
  #.........................................................................................................
  cdef unsigned int a_length            = cids_a.length
  cdef unsigned int b_length            = cids_b.length
  #.........................................................................................................
  # Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the
  # other text. This also works for two empty texts, but those have already been taken care of by the
  # previous shortcut:
  #.........................................................................................................
  if a_length == 0: return b_length
  if b_length == 0: return a_length
  #.........................................................................................................
  cdef unsigned int row_length          = b_length   + 1
  cdef unsigned int row_length_1        = row_length - 1
  cdef unsigned int row_bytecount       = sizeof( unsigned int ) * row_length
  cdef unsigned int *oneago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *twoago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int *thisrow            = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  cdef unsigned int idx                 = 0
  cdef unsigned int idx_a               = 0
  cdef unsigned int idx_b               = 0
  cdef          int idx_a_1_text        = 0
  cdef          int idx_b_1_row         = 0
  cdef          int idx_b_2_row         = 0
  cdef          int idx_b_1_text        = 0
  cdef unsigned int deletion_cost       = 0
  cdef unsigned int addition_cost       = 0
  cdef unsigned int substitution_cost   = 0
  #.........................................................................................................
  # Equivalent of ``thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]``:
  #print( '#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1 )
  for idx from 1 <= idx < row_length:
    thisrow[ idx - 1 ] = idx
  thisrow[ row_length - 1 ] = 0
  #.........................................................................................................
  for idx_a from 0 <= idx_a < a_length:
    idx_a_1_text      = _warp(   a_length, idx_a - 1 )
    twoago, oneago = oneago, thisrow
    #.......................................................................................................
    # Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``:
    for idx from 0 <= idx < row_length_1:
      thisrow[ idx ] = 0
    thisrow[ row_length - 1 ] = idx_a + 1
    #.......................................................................................................
    # some diagnostic output:
    x = []
    for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
    print
    print '#ED  A', x
    #.......................................................................................................
    for idx_b from 0 <= idx_b < b_length:
      #.....................................................................................................
      idx_b_1_row       = _warp( row_length, idx_b - 1 )
      idx_b_1_text      = _warp(   b_length, idx_b - 1 )
      #.....................................................................................................
      assert 0 <= idx_b_1_row  < row_length, ( '#323', idx_b_1_row, )
      assert 0 <= idx_a_1_text <   a_length, ( '#324', idx_a_1_text, )
      assert 0 <= idx_b_1_text <   b_length, ( '#325', idx_b_1_text, )
      #.....................................................................................................
      deletion_cost     = oneago[  idx_b       ] + 1
      addition_cost     = thisrow[ idx_b_1_row ] + 1
      substitution_cost = oneago[  idx_b_1_row ] + ( 1 if    cids_a.data[ idx_a ]
                                                          != cids_b.data[ idx_b ] else 0 )
      thisrow[ idx_b ]  = _minimum_of_three_uints( deletion_cost, addition_cost, substitution_cost )
      #.....................................................................................................
      # Transpositions:
      if (  idx_a > 0
        and idx_b > 0
        and cids_a.data[ idx_a        ] == cids_b.data[ idx_b_1_text ]
        and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b        ]
        and cids_a.data[ idx_a        ] != cids_b.data[ idx_b        ] ):
        #...................................................................................................
        idx_b_2_row       = _warp( row_length, idx_b - 2 )
        assert 0 <= idx_b_2_row  < row_length, ( '#340', idx_b_2_row, )
        thisrow[ idx_b ]  = _minimum_of_two_uints( thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      x = []
      for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
      print '#ED  B', x
  #.........................................................................................................
  # Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the
  # shortcut above:
  cdef unsigned int R = thisrow[ b_length - 1 ]
  #.........................................................................................................
  # Always remember the milk:
  # BUG: Activating below lines leads to glibc failing with ``double free or corruption``
  #free( twoago )
  #free( oneago )
  #free( thisrow )e
  #.........................................................................................................
  return R

#-----------------------------------------------------------------------------------------------------------
def editdistance_reference( text_a, text_b ):
  """This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions,
  insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned
  from the above method. Code adapted from
  http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance"""
  # Conceptually, the implementation is based on a ``( len( seq1 ) + 1 ) * ( len( seq2 ) + 1 )`` matrix.
  # However, only the current and two previous rows are needed at once, so we only store those. Python
  # lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This
  # matches with the zero-indexed strings and saves extra calculation.
  b_length  = len( text_b )
  oneago    = None
  thisrow   = list( range( 1, b_length + 1 ) ) + [ 0 ]
  for idx_a in range( len( text_a ) ):
    twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]
    #.......................................................................................................
    # some diagnostic output:
    print
    print '#EDR A', thisrow
    #.......................................................................................................
    for idx_b in range( b_length ):
      deletion_cost     = oneago[  idx_b     ] + 1
      addition_cost     = thisrow[ idx_b - 1 ] + 1
      substitution_cost = oneago[  idx_b - 1 ] + ( text_a[ idx_a ] != text_b[ idx_b ] )
      thisrow[ idx_b ]  = min( deletion_cost, addition_cost, substitution_cost )
      if (  idx_a > 0
        and idx_b > 0
        and text_a[ idx_a     ] == text_b[ idx_b - 1 ]
        and text_a[ idx_a - 1 ] == text_b[ idx_b     ]
        and text_a[ idx_a     ] != text_b[ idx_b     ] ):
        thisrow[ idx_b ] = min( thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1 )
      #.....................................................................................................
      # some diagnostic output:
      print '#EDR B', thisrow
      #.....................................................................................................
  return thisrow[ len( text_b ) - 1 ]

編集私もこのテキストをpastebinとCythonリストに投稿しました。

score 2 · Accepted Answer

基本的なデバッグを行います。とマークされた 2 番目の出力行で問題が発生していることがわかります#ED B。間違った値は、早い段階で 1 つの編集が検出され、それ以上は検出されないことを示しているようです。これはおそらく、min()引数の 1 つが何らかの理由で 1 に固定されdeletion_costてsubstitution_costいるためaddition_costです。なぜ間違っているのですか？入力テキスト値を出力します。移調セクションを一時的に無効にして、問題が解決するかどうかを確認します。_warpケーパー (私が見たことがあればトリッキーなホビットのギミック) とその使用法を確認して再確認してください。「あああああ」と「あああああ」を比べるとどうなる？「qwerty」で「qwerty」？「xxxxx」と「yyyyy」？bytes、bytearrayおよびstr入力のすべてで問題が発生しますか?

無料の問題：めまいではなく、腐敗を疑う. 3 つの配列を出力します。内容は期待通りですか？free()一度に1 つのアレイを有効にしてみてください-- すべて壊れていますか? 唯一？どれ？

メモリ管理に関するいくつかの補足:これを読んで、malloc/free の代わりに Python 固有のルーチンを使用することを検討してください。サロゲートがあった場合に配列を縮小することは、やりすぎのようです。

更新：私自身の提案に従いました。削除費用を詰め込みました。「oneago」は「thisrow」と同じでした。間違った答えと二重 (-! 壊れていない!-) の両方を引き起こす問題 free: ポインターの循環シャッフルは循環ではありませんでした。

# twoago, oneago = oneago, thisrow ### BUG ###
twoago, oneago, thisrow = oneago, thisrow, twoago ### FIXED ###

更新 2: [コメント容量が小さすぎる] モジョはありません。私が示唆したように、ただの普通のデバッグ作業です。「私の修正のためにこれに集中する」は「非常に読みやすい」ではありません。参照コードはパスごとに新しいリストを作成します。thisrow前のパスから持ち越されたものは何もありません。これを行う必要はありません。実際、最初と最後の要素以外の初期化は乱数で構成でき、リストを埋めるためだけに存在するため、非要素として追加する代わりにインデックスを付けることができます-トリッキーな実装はそうです。したがって、余分な (無駄な) malloc/free を実行することを犠牲にして、「参照実装」を惜しみなくエミュレートするか、Python 固有の実装の詳細を無視して、おそらく正しい答えのソースとしてのみ参照実装を使用することができます。次に、私の修正を受け入れ、後でthisrow配列の初期化の大部分を切り捨てることで時間を節約できます。

更新 3:これは、代替の参照実装です。外側のループ内でのリスト作成のオーバーヘッドを避けるために、最初に 3 行を割り当てます。また、の最後の要素を除くすべての不必要な初期化を回避しますthisrow。これにより、C/Cython への変換が容易になります。

def damlevref2(seq1, seq2):
    # For Python 2.x as was the original.
    # Appears to work on Python 1.5.2 as well :-)
    seq2len = len(seq2)
    twoago = [-777] * (seq2len + 1) # pseudo-malloc; any old rubbish will do
    oneago = [-666] * (seq2len + 1) # ditto
    thisrow = range(1, seq2len + 1) + [0]
    for x in xrange(len(seq1)):
        twoago, oneago, thisrow = oneago, thisrow, twoago # circular "pointer" shuffle
        thisrow[-1] = x + 1
        for y in xrange(seq2len):
            delcost = oneago[y] + 1
            addcost = thisrow[y - 1] + 1
            subcost = oneago[y - 1] + (seq1[x] != seq2[y])
            thisrow[y] = min(delcost, addcost, subcost)
            if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
                and seq1[x-1] == seq2[y] and seq1[x] != seq2[y]):
                thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
    return thisrow[seq2len - 1]

python - このダメラウ・レーベンシュタインの実装のバグを修正するにはどうすればよいですか？

1 に答える 1

Related

Reference