java - ジャロ・ウィンクラーアルゴリズムの最適化

Question

このWebサイトから取得したJaro-Winklerアルゴリズムのこのコードがあります。違いの間の距離を取得するには、150,000回実行する必要があります。Androidモバイルデバイスで実行しているため、時間がかかります。

もっと最適化できますか？

public class Jaro {
    /**
     * gets the similarity of the two strings using Jaro distance.
     *
     * @param string1 the first input string
     * @param string2 the second input string
     * @return a value between 0-1 of the similarity
     */
    public float getSimilarity(final String string1, final String string2) {

        //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
        final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);

        //get common characters
        final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
        final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);

        //check for zero in common
        if (common1.length() == 0 || common2.length() == 0) {
            return 0.0f;
        }

        //check for same length common strings returning 0.0f is not the same
        if (common1.length() != common2.length()) {
            return 0.0f;
        }

        //get the number of transpositions
        int transpositions = 0;
        int n=common1.length();
        for (int i = 0; i < n; i++) {
            if (common1.charAt(i) != common2.charAt(i))
                transpositions++;
        }
        transpositions /= 2.0f;

        //calculate jaro metric
        return (common1.length() / ((float) string1.length()) +
                common2.length() / ((float) string2.length()) +
                (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
    }

    /**
     * returns a string buffer of characters from string1 within string2 if they are of a given
     * distance seperation from the position in string1.
     *
     * @param string1
     * @param string2
     * @param distanceSep
     * @return a string buffer of characters from string1 within string2 if they are of a given
     *         distance seperation from the position in string1
     */
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
        //create a return buffer of characters
        final StringBuffer returnCommons = new StringBuffer();
        //create a copy of string2 for processing
        final StringBuffer copy = new StringBuffer(string2);
        //iterate over string1
        int n=string1.length();
        int m=string2.length();
        for (int i = 0; i < n; i++) {
            final char ch = string1.charAt(i);
            //set boolean for quick loop exit if found
            boolean foundIt = false;
            //compare char with range of characters to either side

            for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                //check if found
                if (copy.charAt(j) == ch) {
                    foundIt = true;
                    //append character found
                    returnCommons.append(ch);
                    //alter copied string2 for processing
                    copy.setCharAt(j, (char)0);
                }
            }
        }
        return returnCommons;
    }
}

プロセス全体で、スクリプトのインスタンスを作成するだけなので、一度だけ作成します。

jaro= new Jaro();

スクリプトを壊さないようにテストして例が必要な場合は、Python最適化の別のスレッドでここにあります。

score 7 · Accepted Answer

はい、しかしあなたはそれを楽しむつもりはありません。これらすべてのnewedStringBufferを、コンストラクターで割り当てられ、二度と割り当てられないchar配列に置き換えます。整数インデックスを使用して、それらの中にあるものを追跡します。

この保留中のCommons-Langパッチは、いくつかのフレーバーを提供します。

score 4 · Accepted Answer

この質問はおそらくしばらくの間解決されていると思いますが、アルゴリズム自体についてコメントしたいと思います。文字列をそれ自体と比較すると、答えは1 / |string|であることがわかります。オフ。わずかに異なる値を比較すると、値も低くなります。

これに対する解決策は、getCommonCharactersメソッド内の内部forステートメントで「m-1」を「m」に調整することです。コードは魅力のように機能します:)

いくつかの例については、 http：//en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distanceも参照してください。

score 0 · Accepted Answer

getCommonCharactersループ内の2つのネストされたループを回避するようにしてください。
方法に関する提案：ある種のマップ（javaにはいくつかあります）の小さい文字列にすべての文字を格納します。ここで、キーは文字で、値は位置です。これにより、距離を計算できます。共通しています。アルゴリズムはよくわかりませんが、実行可能だと思います。
それとbmarguliesの答えを除いて、ビットなど以外の最適化は実際には見られません。これが本当に重要な場合は、この部分をCで書き直すことを検討してください。

score 0 · Accepted Answer

Androidと、Androidがデータベースでどのように機能するかについてはよくわかりません。WP7には（:)）SQLCEがあります。次のステップは通常、データを操作することです。文字列の長さを追加し、比較を制限します。両方の列にインデックスを追加し、長さで並べ替えてから値で並べ替えます。長さのインデックスもソートする必要があります。150 000の医学用語を使用して古いサーバーで実行し、0.5秒以内に提案とスペルチェックを実行しました。特に別のスレッドで実行している場合、ユーザーはほとんど気付かないでしょう。

必要があるので、私はそれについて長い間（2年のように:)）ブログを書くつもりでした。しかし、私はついにそれについていくつかの言葉を書き、いくつかのヒントを提供することができました。ここでそれをチェックしてください：

ISolvable.blogspot.com

これはMicrosoftプラットフォーム用ですが、それでも一般的な原則は同じです。

score 0 · Accepted Answer

はい、これははるかに速くすることができます。一つには、StringBuffersはまったく必要ありません。もう1つは、転置をカウントするための個別のループは必要ありません。

ここで私の実装を見つけることができます、そしてそれははるかに速いはずです。Apache2.0ライセンスの下にあります。

score 0 · Accepted Answer

GetCommonCharactersメソッドを使用して共通文字を返す代わりに、ここのCバージョンと同様に、いくつかの配列を使用して一致を維持しますhttps://github.com/miguelvps/c/blob/master/jarowinkler.c

/*Calculate matching characters*/
for (i = 0; i < al; i++) {
    for (j = max(i - range, 0), l = min(i + range + 1, sl); j < l; j++) {
        if (a[i] == s[j] && !sflags[j]) {
            sflags[j] = 1;
            aflags[i] = 1;
            m++;
            break;
        }
    }
}

もう1つの最適化は、各文字列のビットマスクを事前に計算することです。それを使用して、最初の文字列の現在の文字が2番目の文字列に存在するかどうかを確認します。これは、効率的なビット演算を使用して実行できます。

これにより、最大/最小の計算と欠落している文字のループがスキップされます。

java - ジャロ・ウィンクラーアルゴリズムの最適化

6 に答える 6

Related

Reference