python - 隣接関係の重み/ペナルティを伴うレーベンシュタイン距離

Question

文字列編集距離 (レーベンシュタイン距離) を使用して、アイトラッキング実験のスキャンパスを比較しています。（現在stringdist、Rでパッケージを使用しています）

基本的に文字列の文字は、6x4 マトリックスの (注視) 位置を表します。マトリックスは次のように構成されています。

     [,1] [,2] [,3] [,4]
[1,]  'a'  'g'  'm'  's' 
[2,]  'b'  'h'  'n'  't'
[3,]  'c'  'i'  'o'  'u'
[4,]  'd'  'j'  'p'  'v'
[5,]  'e'  'k'  'q'  'w'
[6,]  'f'  'l'  'r'  'x'

基本的なレーベンシュタイン距離を使用して文字列を比較するaと、文字列内のとの比較はとの比較gと同じ推定値にaなりxます。

例えば：

'abc' compared to 'agc' -> 1
'abc' compared to 'axc' -> 1

これは、文字列が等しく (異なる) 類似していることを意味します。

マトリックスに隣接性を組み込む方法で、文字列比較に重みを付けたいと思います。たとえば、との間の距離は、との間の距離aよりxも大きく重み付けする必要がaありgます。

One way could be to calculate the "walk" (horizontal and vertial steps) from one letter to the other in the matrix and divide by the max "walk"-distance (i.e. from a to x). E.g. the "walk"-distance from a to g would be 1 and from a to x it would be 8 resulting in a weight of 1/8 and 1 respectively.

Is there a way to implement this (in either R or python)?

score 2 · Accepted Answer

このライブラリをチェックしてください: https://github.com/infoscout/weighted-levenshtein (免責事項: 私は著者です)。加重レーベンシュタイン距離、加重最適文字列配置、および加重ダメラウレーベンシュタイン距離をサポートしています。最適なパフォーマンスを得るために Cython で記述されており、経由で簡単にインストールできますpip install weighted-levenshtein。フィードバックとプルリクエストを歓迎します。

サンプル使用法:

import numpy as np
from weighted_levenshtein import lev


insert_costs = np.ones(128, dtype=np.float64)  # make an array of all 1's of size 128, the number of ASCII characters
insert_costs[ord('D')] = 1.5  # make inserting the character 'D' have cost 1.5 (instead of 1)

# you can just specify the insertion costs
# delete_costs and substitute_costs default to 1 for all characters if unspecified
print lev('BANANAS', 'BANDANAS', insert_costs=insert_costs)  # prints '1.5'

python - 隣接関係の重み/ペナルティを伴うレーベンシュタイン距離

4 に答える 4

Related

Reference