java - 効率的なレーベンシュタイン距離計算のための単純なトライの実装-Java

Question

更新3

終わり。以下は、最終的にすべてのテストに合格したコードです。繰り返しになりますが、これはMuriloVasconceloによるSteveHanovのアルゴリズムの修正バージョンをモデルにしています。助けてくれたすべてに感謝します！

/**
 * Computes the minimum Levenshtein Distance between the given word (represented as an array of Characters) and the
 * words stored in theTrie. This algorithm is modeled after Steve Hanov's blog article "Fast and Easy Levenshtein
 * distance using a Trie" and Murilo Vasconcelo's revised version in C++.
 * 
 * http://stevehanov.ca/blog/index.php?id=114
 * http://murilo.wordpress.com/2011/02/01/fast-and-easy-levenshtein-distance-using-a-trie-in-c/
 * 
 * @param ArrayList<Character> word - the characters of an input word as an array representation
 * @return int - the minimum Levenshtein Distance
 */
private int computeMinimumLevenshteinDistance(ArrayList<Character> word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int iWordLength = word.size();
    int[] currentRow = new int[iWordLength + 1];

    for (int i = 0; i <= iWordLength; i++) {
        currentRow[i] = i;
    }

    for (int i = 0; i < iWordLength; i++) {
        traverseTrie(theTrie.root, word.get(i), word, currentRow);
    }
    return theTrie.minLevDist;
}

/**
 * Recursive helper function. Traverses theTrie in search of the minimum Levenshtein Distance.
 * 
 * @param TrieNode node - the current TrieNode
 * @param char letter - the current character of the current word we're working with
 * @param ArrayList<Character> word - an array representation of the current word
 * @param int[] previousRow - a row in the Levenshtein Distance matrix
 */
private void traverseTrie(TrieNode node, char letter, ArrayList<Character> word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int minimumElement = currentRow[0];
    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);

        if (currentRow[i] < minimumElement) {
            minimumElement = currentRow[i];
        }
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minimumElement < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            traverseTrie(node.children.get(c), c, word, currentRow);
        }
    }
}

更新2

最後に、ほとんどのテストケースでこれを機能させることができました。私の実装は、実際には、MuriloのC++バージョンのSteveHanovのアルゴリズムからの直接翻訳です。では、このアルゴリズムをどのようにリファクタリングしたり、最適化したりする必要がありますか？以下はコードです...

public int search(String word) {

    theTrie.minLevDist = Integer.MAX_VALUE;

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
    return theTrie.minLevDist;
}
private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int insertCost, deleteCost, replaceCost;

    for (int i = 1; i < size; i++) {

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;

        if (word.charAt(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }
        currentRow[i] = minimum(insertCost, deleteCost, replaceCost);
    }

    if (currentRow[size - 1] < theTrie.minLevDist && node.isWord) {
        theTrie.minLevDist = currentRow[size - 1];
    }

    if (minElement(currentRow) < theTrie.minLevDist) {

        for (Character c : node.children.keySet()) {
            searchRec(node.children.get(c), c, word, currentRow);

        }
    }
}

この質問に貢献してくれた皆さん、ありがとうございました。Levenshtein Automataを動作させようとしましたが、実現できませんでした。

したがって、上記のコードに関するリファクタリングや最適化に関する提案を探しています。混乱があれば教えてください。いつものように、必要に応じて残りのソースコードを提供できます。

更新1

そこで、単純なTrieデータ構造を実装し、Steve HanovのPythonチュートリアルに従って、レーベンシュタイン距離を計算しようとしました。実際、私は特定の単語とTrie内の単語の間の最小レーベンシュタイン距離を計算することに興味があるので、 MuriloVasconcelosのバージョンのSteveHanovのアルゴリズムに従っています。うまく機能していませんが、これが私のTrieクラスです。

public class Trie {

    public TrieNode root;
    public int minLevDist;

    public Trie() {
        this.root = new TrieNode(' ');
    }

    public void insert(String word) {

        int length = word.length();
        TrieNode current = this.root;

        if (length == 0) {
            current.isWord = true;
        }
        for (int index = 0; index < length; index++) {

            char letter = word.charAt(index);
            TrieNode child = current.getChild(letter);

            if (child != null) {
                current = child;
            } else {
                current.children.put(letter, new TrieNode(letter));
                current = current.getChild(letter);
            }
            if (index == length - 1) {
                current.isWord = true;
            }
        }
    }
}

...およびTrieNodeクラス：

public class TrieNode {

    public final int ALPHABET = 26;

    public char letter;
    public boolean isWord;
    public Map<Character, TrieNode> children;

    public TrieNode(char letter) {
        this.isWord = false;
        this.letter = letter;
        children = new HashMap<Character, TrieNode>(ALPHABET);
    }

    public TrieNode getChild(char letter) {

        if (children != null) {
            if (children.containsKey(letter)) {
                return children.get(letter); 
            }
        }
        return null;
    }
}

Murilo Vasconcelosが持っているように検索を実装しようとしましたが、何かがおかしいので、これをデバッグするのに助けが必要です。これをリファクタリングする方法や、バグがどこにあるかを指摘する方法について提案してください。最初にリファクタリングしたいのは「minCost」グローバル変数ですが、これは最小のものです。とにかく、ここにコードがあります...

public void search(String word) {

    int size = word.length();
    int[] currentRow = new int[size + 1];

    for (int i = 0; i <= size; i++) {
        currentRow[i] = i;
    }
    for (int i = 0; i < size; i++) {
        char c = word.charAt(i);
        if (theTrie.root.children.containsKey(c)) {
            searchRec(theTrie.root.children.get(c), c, word, currentRow);
        }
    }
}

private void searchRec(TrieNode node, char letter, String word, int[] previousRow) {

    int size = previousRow.length;
    int[] currentRow = new int[size];
    currentRow[0] = previousRow[0] + 1;

    int replace, insertCost, deleteCost;

    for (int i = 1; i < size; i++) {

        char c = word.charAt(i - 1);

        insertCost = currentRow[i - 1] + 1;
        deleteCost = previousRow[i] + 1;
        replace = (c == letter) ? previousRow[i - 1] : (previousRow[i - 1] + 1);

        currentRow[i] = minimum(insertCost, deleteCost, replace);
    }

    if (currentRow[size - 1] < minCost && !node.isWord) {
        minCost = currentRow[size - 1];
    }
    Integer minElement = minElement(currentRow);
    if (minElement < minCost) {

        for (Map.Entry<Character, TrieNode> entry : node.children.entrySet()) {
            searchRec(node, entry.getKey(), word, currentRow);
        }
    }
}

コメントが不足していることをお詫び申し上げます。だから私は何が間違っているのですか？

初期投稿

2つの弦の間のレーベンシュタイン距離を計算する効率的な方法を理解することを期待して、「トライを使用した高速で簡単なレーベンシュタイン距離」という記事を読んでいます。これに関する私の主な目標は、大量の単語セットが与えられた場合に、入力単語とこの単語セットの間の最小レーベンシュタイン距離を見つけることができるようにすることです。

私の簡単な実装では、入力単語ごとに、入力単語と単語のセットの間のレーベンシュタイン距離を計算し、最小値を返します。動作しますが、効率的ではありません...

私はJavaでのTrieの実装を探していましたが、2つの一見良い情報源に出くわしました。

Koders.comバージョン
code.google.comバージョン（編集：これはgithub.com/rkapsiに移動したようです）

ただし、これらの実装は、私がやろうとしていることには複雑すぎるようです。それらがどのように機能し、Trieデータ構造が一般的にどのように機能するかを理解するためにそれらを読んでいると、私はさらに混乱するようになりました。

では、Javaで単純なTrieデータ構造を実装するにはどうすればよいでしょうか。私の直感によると、各TrieNodeは、それが表す文字列と、必ずしもすべての文字ではなく、アルファベットの文字への参照を格納する必要があります。私の直感は正しいですか？

それが実装されたら、次のタスクはレーベンシュタイン距離を計算することです。上記の記事のPythonコード例を読みましたが、Pythonについては話せません。再帰検索を実行すると、Java実装のヒープメモリが不足します。では、Trieデータ構造を使用してレーベンシュタイン距離をどのように計算しますか？このソースコードをモデルにした簡単な実装がありますが、Trieを使用していません...非効率的です。

あなたのコメントや提案に加えて、いくつかのコードを見るのは本当に素晴らしいことです。結局のところ、これは私にとっての学習プロセスです...私はTrieを実装したことがありません...したがって、この経験から学ぶことがたくさんあります。

ありがとう。

ps必要に応じて、任意のソースコードを提供できます。また、 Nick Johnsonのブログで提案されているようにBK-Treeを読んで使用してみましたが、思ったほど効率的ではありません...または私の実装が間違っている可能性があります。

score 11 · Accepted Answer

レーベンシュタイン距離の効率を改善する必要はないと私が言えることから、距離計算を何度も実行する必要がない構造に文字列を格納する必要があります。つまり、検索スペースを削除します。

レーベンシュタイン距離は距離であるため、三角不等式を利用する任意の距離空間インデックスを使用できます。BK-Treesについて言及しましたが、他にもあります。ヴァンテージポイントツリー、固定クエリツリー、二等分線ツリー、空間近似ツリー。説明は次のとおりです。

Burkhard-Keller Tree

ノードは次のようにツリーに挿入されます。ルートノードの場合、スペースから任意の要素を選択します。各エッジの値がピボットからその要素までの距離になるように、一意のエッジラベル付きの子を追加します。再帰的に適用し、エッジがすでに存在する場合は子をピボットとして選択します。

修正済み-クエリツリー

BKTと同じですが、次の点が異なります。要素はリーフに格納されます。各葉には複数の要素があります。ツリーの各レベルで、同じピボットが使用されます。

二等分線ツリー

各ノードには、カバー半径（中心要素とそのサブツリー要素のいずれかとの間の最大距離）を持つ2つのピボット要素が含まれています。最初のピボットに最も近い要素と2番目のピボットに最も近い要素を2つのセットにフィルター処理し、これらのセットから2つのサブツリーを再帰的に構築します。

空間近似ツリー

最初はすべての要素がバッグに入っています。ピボットとなる任意の要素を選択します。ピボットの範囲内で最も近いネイバーのコレクションを構築します。残りの各要素を、作成したばかりのコレクションからそれに最も近い要素のバッグに入れます。このコレクションの各要素から再帰的にサブツリーを形成します。

ヴァンテージポイントツリー

セットからピボットを任意に選択します。このピボットと残りのセットの各要素の間の距離の中央値を計算します。セットの要素を左右の再帰サブツリーにフィルター処理して、距離が中央値以下の要素が左を形成し、距離が大きい要素が右を形成するようにします。

score 9 · Accepted Answer

「トライを使用した高速で簡単なレーベンシュタイン距離」の記事で説明されているアルゴリズムをC++で実装しましたが、これは非常に高速です。必要に応じて（PythonよりもC ++をよく理解している）、コードをどこかに貼り付けることができます。

編集：ブログに投稿しました。

score 3 · Accepted Answer

これは、 JavaでのLevenshtein Automataの例です（編集：githubに移動しました）。これらもおそらく役立つでしょう：

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/ http://svn.apache.org/repos/asf/ lucene / dev / trunk / lucene / src / test / org / apache / lucene / util / automaton /

編集：上記のリンクはgithubに移動したようです：

https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene/util/automaton https://github.com/apache/lucene-solr/tree / master / lucene / core / src / test / org / apache / lucene / util / automaton

実験的なLuceneコードはdk.brics.automatonパッケージに基づいているようです。

使用法は次のようになります。

LevenshteinAutomata builder = new LevenshteinAutomata(s);
Automaton automata = builder.toAutomaton(n);
boolean result1 = BasicOperations.run(automata, "foo");
boolean result2 = BasicOperations.run(automata, "bar");

score 2 · Accepted Answer

多くの点で、Steve Hanovのアルゴリズム（質問にリンクされている最初の記事で示されている、Trieを使用した高速で簡単なレーベンシュタイン距離）、Muriloとあなた（OP）によって作成されたアルゴリズムのポート、そしておそらくすべての関連するアルゴリズムトライまたは同様の構造で、レーベンシュタインオートマトン（ここで数回言及されています）のように機能します。

Given:
       dict is a dictionary represented as a DFA (ex. trie or dawg)
       dictState is a state in dict
       dictStartState is the start state in dict
       dictAcceptState is a dictState arrived at after following the transitions defined by a word in dict
       editDistance is an edit distance
       laWord is a word
       la is a Levenshtein Automaton defined for laWord and editDistance
       laState is a state in la
       laStartState is the start state in la
       laAcceptState is a laState arrived at after following the transitions defined by a word that is within editDistance of laWord
       charSequence is a sequence of chars
       traversalDataStack is a stack of (dictState, laState, charSequence) tuples

Define dictState as dictStartState
Define laState as laStartState
Push (dictState, laState, "") on to traversalDataStack
While traversalDataStack is not empty
    Define currentTraversalDataTuple as the the product of a pop of traversalDataStack
    Define currentDictState as the dictState in currentTraversalDataTuple
    Define currentLAState as the laState in currentTraversalDataTuple
    Define currentCharSequence as the charSequence in currentTraversalDataTuple
    For each char in alphabet
        Check if currentDictState has outgoing transition labeled by char
        Check if currentLAState has outgoing transition labeled by char
        If both currentDictState and currentLAState have outgoing transitions labeled by char
            Define newDictState as the state arrived at after following the outgoing transition of dictState labeled by char
            Define newLAState as the state arrived at after following the outgoing transition of laState labeled by char
            Define newCharSequence as concatenation of currentCharSequence and char
            Push (newDictState, newLAState, newCharSequence) on to currentTraversalDataTuple
            If newDictState is a dictAcceptState, and if newLAState is a laAcceptState
                Add newCharSequence to resultSet
            endIf
        endIf
    endFor
endWhile

スティーブハノフのアルゴリズムとその前述の導関数は、明らかに、正式なレーベンシュタインオートマトンの代わりにレーベンシュタイン距離計算行列を使用します。かなり高速ですが、正式なLevenshtein Automatonは、パラメトリック状態（オートマトンの具体的な状態を表す抽象的な状態）を生成してトラバーサルに使用し、編集距離に関連するランタイム計算を一切バイパスできます。したがって、前述のアルゴリズムよりもさらに高速に実行する必要があります。

あなた（または他の誰か）が正式なLevenshtein Automatonソリューションに興味がある場合は、LevenshteinAutomatonをご覧ください。これは、前述のパラメトリック状態ベースのアルゴリズムに加えて、純粋なコンクリート状態トラバーサルベースのアルゴリズム（上記で概説）および動的計画法ベースのアルゴリズム（編集距離と隣接決定の両方）を実装します。それは本当にあなたによって維持されています:)。

score 1 · Accepted Answer

私の直感によると、各TrieNodeは、それが表す文字列と、必ずしもすべての文字ではなく、アルファベットの文字への参照を格納する必要があります。私の直感は正しいですか？

いいえ、トライは文字列を表すのではなく、文字列のセット（およびそのすべてのプレフィックス）を表します。トライノードは、入力文字を別のトライノードにマップします。したがって、文字の配列や対応するTrieNode参照の配列のようなものを保持する必要があります。（特定の使用の効率によっては、正確な表現ではない場合があります。）

score 1 · Accepted Answer

私が正しく見ているように、あなたはトライのすべての枝をループしたいと思っています。再帰関数を使用することはそれほど難しくありません。同じ種類の関数を使用して、k最近傍アルゴリズムでもトライを使用しています。Javaはわかりませんが、次のような擬似コードがあります。

function walk (testitem trie)
   make an empty array results
   function compare (testitem children distance)
     if testitem = None
        place the distance and children into results
     else compare(testitem from second position, 
                  the sub-children of the first child in children,
                  if the first item of testitem is equal to that 
                  of the node of the first child of children 
                  add one to the distance (! non-destructive)
                  else just the distance)
        when there are any children left
             compare (testitem, the children without the first item,
                      distance)
    compare(testitem, children of root-node in trie, distance set to 0)
    return the results

それが役に立てば幸い。

score 1 · Accepted Answer

関数walkは、テスト項目（たとえば、索引付け可能な文字列、または文字の配列）とトライを取ります。トライは、2つのスロットを持つオブジェクトにすることができます。1つはトライのノードを指定し、もう1つはそのノードの子を指定します。子供たちも試してみます。Pythonでは、次のようになります。

class Trie(object):
    def __init__(self, node=None, children=[]):
        self.node = node
        self.children = children

またはLispで...

(defstruct trie (node nil) (children nil))

これで、トライは次のようになります。

(trie #node None
      #children ((trie #node f
                       #children ((trie #node o
                                        #children ((trie #node o
                                                         #children None)))
                                  (trie #node u
                                        #children ((trie #node n
                                                         #children None)))))))

これで、内部関数（個別に作成することもできます）は、testitem、ツリーのルートノードの子（ノード値はNoneなど）、および初期距離を0に設定します。

次に、ツリーの両方のブランチを、左から右に再帰的にトラバースします。

score 1 · Accepted Answer

誰かがこの問題のさらに別の治療法を探している場合に備えて、これをここに残しておきます。

http://code.google.com/p/oracleofwoodyallen/wiki/AppearanceStringMatching

score 1 · Accepted Answer

私はあなたの最新のアップデート3を見ていましたが、アルゴリズムは私にはうまく機能していないようです。

以下のテストケースがあることを確認しましょう。

    Trie dict = new Trie();
    dict.insert("arb");
    dict.insert("area");

    ArrayList<Character> word = new ArrayList<Character>();
    word.add('a');
    word.add('r');
    word.add('c');

この場合、"arc"とdictの間の最小編集距離は1である必要があります。これは、との間の編集距離"arc"です"arb"が、アルゴリズムは代わりに2を返します。

私は以下のコードピースを通過しました：

        if (word.get(i - 1) == letter) {
            replaceCost = previousRow[i - 1];
        } else {
            replaceCost = previousRow[i - 1] + 1;
        }

少なくとも最初のループでは、文字は単語の文字の1つですが、代わりに、トライのノードを比較する必要があるため、単語の最初の文字と1行重複しますね。各DPマトリックスには、重複として最初の行があります。私はあなたがソリューションに置いたのとまったく同じコードを実行しました。

score 0 · Accepted Answer

さて、これが私がずっと前にそれをした方法です。辞書をトライとして保存しました。これは、ツリーの形式に制限された単純な有限状態マシンです。その制限をしないことでそれを強化することができます。たとえば、一般的なサフィックスは単に共有サブツリーにすることができます。「nation」、「national」、「nationalize」、「nationalization」などをキャプチャするために、ループを作成することもできます。

トライはできるだけシンプルにしてください。文字列を詰め込まないでください。

2つの指定された文字列間の距離を見つけるためにこれを行わないことを忘れないでください。これを使用して、特定の1つの文字列に最も近い辞書内の文字列を検索します。所要時間は、許容できるレーベンシュタイン距離によって異なります。距離がゼロの場合、それは単純にO（n）です。ここで、nは単語の長さです。任意の距離の場合、O（N）です。ここで、Nは辞書内の単語数です。

score 0 · Accepted Answer

私が間違っている場合は訂正してください。ただし、update3には不要な余分なループがあり、プログラムが非常に遅くなると思います。

for (int i = 0; i < iWordLength; i++) {
    traverseTrie(theTrie.root, word.get(i), word, currentRow);
}

traverseTrie内ではすでに単語全体をループしているため、traverseTrieを1回だけ呼び出す必要があります。コードは次のようにする必要があります。

traverseTrie(theTrie.root, ' ', word, currentRow);

java - 効率的なレーベンシュタイン距離計算のための単純なトライの実装-Java

更新3

更新2

更新1

初期投稿

11 に答える 11

Related

Reference