algorithm - 「絶対」文字列メトリック

Question

自然言語文字列の膨大な (ただし有限の) セットがあります。

各文字列を数値に変換する方法が必要です。どの文字列でも、値は毎回同じでなければなりません。

指定された 2 つの文字列が「異なる」ほど、対応する 2 つの値も異なるはずです。それらが「似ている」ほど、値の違いは少なくなります。

必要な文字列間の違いの正確な定義はまだわかりません。とにかく自然言語の解析はありません。おそらく、Levenstein のようなものである必要があります (ただし、Levenstein は相対的であり、絶対メトリックが必要です)。簡単なことから始めましょう。

寸法の更新

単一の数値ではなく、多次元 (3 次元が最適) のベクトルで解決できれば幸いです。

期待される結果の正確さに関する更新

こことここで正しく指摘されているように、ある文字列から別の文字列までの距離はMAX(firstStringLength, secondStringLength)次元を持つベクトルです。一般に、情報の損失なしに次元数を減らすことはできません。

ただし、絶対的な解決策は必要ありません。N 次元の文字列空間から 3D 空間への「十分な」変換で解決します。

また、有限の長さの有限数の文字列があることにも注意してください。(ただし、文字列の数はかなり多く、約 8000 万 (10 GB) であるため、シングルパスのステートレスアルゴリズムを選択することをお勧めします。)

参考文献をスキャンしたところ、ヒルベルトの空間充填曲線がここで役立つ可能性があるという印象を受けました。Hilbert Space-Filling Curve 記事のクラスタリングプロパティの分析のように見えますが、私の問題に近いものについて説明しています...

ヒルベルト曲線アプローチの更新

各文字列を N 次元空間のポイントにマップします。ここで、N はセット内の文字列の最大長です。ところで、文字列の i 番目の文字コードを i 番目の座標値として使用できますか?
その N 次元空間を通るヒルベルト曲線をプロットします。
文字列ごとに、文字列の座標に最も近い曲線上のポイントを取得します。その点のヒルベルト値 (曲線の始点からの長さ) が、私が求める 1 次元の値です。
3D 値が必要な場合は、ヒルベルト曲線を 3D でプロットし、上で計算したヒルベルト値に一致する点を選択します。

これは正しく見えますか？ここでの計算費用はいくらになるでしょうか。

score 5 · Accepted Answer

I don't think this is possible to do. Start with a simple string, and assign it zero (it doesn't really matter what the number is)

"Hello World" = 0

The following strings are at distance 2 from it:

"XXllo World" = a
"HeXXo World" = b
"Hello XXrld" = c
"Hello WorXX" = d

Yet, each of these strings is 4 from each other. There is no way to sort the numbers to make it work, for the following instance:

a = 1, b = -1, c = 2, d = -2

Consider that c to 0 is 2, yet c to a is 1, yet 0 is closer than a.

And this is just a simple case.

score 3 · Accepted Answer

ここで、根本的な問題と解決策を示したいと思います。

問題:完全な解決策を得ることは不可能であるため、「十分な」解決策を検索することは正しいです (これは情報理論で示すことができますが、より読みやすいので幾何学を使用します)。N 次元空間があるため、情報を失うことなく距離メトリックを投影することはできません。

distance projected onto X: (x,y,z).(1,0,0) = x

ただし、複数の次元を考慮したベクトルを使用することはできますが、同じ距離を持つ要素が遠く離れてしまいます。

(30,0,0).(1/3,1/3,1/3) = (0,30,0).(1/3,1/3,1/3) = (0,0,30).(1/3,1/3,1/3) = 10

それでは、解決策について説明します。期待できる最善の方法は、主成分分析を使用してクラスタリングし、文字列が最も異なる 3 つの次元を見つけることです。これも、使用する距離メトリックのコンポーネントに依存し、自明ではありません (つまり、この投稿をこれ以上長くしたくありません)。

簡単な解決策として、以下に説明する 3 つの文字列からのレーベンスタイン距離を使用して、頭の中で PCA をすばやく試行することをお勧めします。

"acegikmoqsuwy" //use half your permitted symbols then repeat until you have a string of size equal to your longest string.
"bdfhjlnprtv" //use the other half then repeat as above.
"" //The empty string, this will just give you the length of the string, so a cheap one.

また、深く掘り下げたい場合は、メトリック/距離に関するこれが役立つ場合があります: http://www.springer.com/mathematics/geometry/book/978-3-642-00233-5

およびレーベンスタイン距離のデモ: http://www.merriampark.com/ld.htm

score 3 · Accepted Answer

I think you are going to have to specify your problem more clearly, what exactly are you trying to achieve with this metric?

I say this, because Levenstein works since it maps pairs of strings to a metric, which can preserve the dimensionality of the string space. What happens if you try and map strings to numbers is that there is a large loss of dimensional information. For example, say I have the string "cat", I'd want "bat", "hat", "rat", "can", "cot" etc. to all be reasonably close to this. With a large number of words, the result is that you end up with dissimilar words being close relatively often e.g. "bat" and "cot" may be close, because they both happen to be similar distances from "cat" on the positive side. This is similar to the problem of what happens when you try and map the plane to a line, it is difficult to meet the restriction that points far away in the plane stay far away on the line. So, the upshot of this is that the 'The more "different" two given strings are, the more different two corresponding values should be' requirement is difficult.

So, my first suggestion is, do you really need something that does this, will a simple hash-code suffice to give you unique values, or perhaps you can use Levenstein after all and ignore the values for individual strings? If none of those suffice, perhaps you can use a multidimensional function value, that is map strings into pairs, triples or another small tuple of numbers. The extra dimensionality granted that way will give you far better results.

An example might be encoding the string as a triple: length, sum of values of letters in string, alternating sum of values of letters e.g. f("cat") = (3, 3 + 1 + 20, 3 - 1 + 20) = (3, 24, 22). This would have some of the properties you desire, but is probably not optimal. Try looking for orthogonal features of the string to do this encoding, or even better, if you have a large test set of strings there are existing libraries for mapping this sort of data into low dimensions while preserving metrics (e.g. the Levenstein metric) and you can train your function on that. I remember the S language had support for this sort of thing.

score 2 · Accepted Answer

FryGuy の回答を拡大して、固定数の次元で機能しない理由を説明したいと思います。aaaaaaaaaaとbaaaaaaaaa、abaaaaaaaa、 ... 、を取りましょうaaaaaaaaab。この例では、文字列の長さは 10 ですが、任意の長さにすることができます。10b個aaaaaaaaaaの文字列のそれぞれの距離は 1 であり、それらの文字列間の距離は 2 です。一般に、2 文字のアルファベットで長さ N の固定文字列を取る場合、それらの距離グラフは N 次元の超立方体になります。

文字列の長さが制限されていない限り、それを固定数の次元にマップする方法はありません。

score 1 · Accepted Answer

また、最大文字列長を制限する必要があるという問題を抱えて、潜在意味解析とベクトル空間モデルを調べることもできます。

寸法は、アルファベットの要素と文字列内の位置の積です。アルファベット（ "a"、 "b"、 "c"、 "t"）と最大長が3の場合、寸法は（a：1、b：1、c：1、t：1、..。、a：3、b：3、c：3、t：3）

例として、"cat"は（0、0、1、0、1、0、0、0、0、0、0、1）になります。

これはもちろん巨大なデータセットですが、次元削減手法（SVDなど）を使用して次元数を削減できます。言葉には繰り返しパターンがたくさんあるので、これはうまくいくはずです。ニーズに合わせて出力ディメンションの数を微調整できます。

2つの単語間の類似性は、単語ベクトル間のコサイン類似性によって計算できます。また、SVDの変換ベクトルを保存して、以前は表示されていなかった単語の縮小ベクトルを取得することもできます。

score 1 · Accepted Answer

空の文字列からの編集距離を測定しますが、各編集を値「1」として扱う代わりに、使用頻度でソートされたアルファベットで追加/削除される文字のインデックスを与えます (etaoinshrdlu...)。アルゴリズムで置換を挿入と削除のペアとしてではなく、置換として検出できる場合は、文字インデックスの違い。

score 0 · Accepted Answer

「相対距離」の問題を克服するには、測定する固定点を取得するだけです。

レーベンスタイン距離を使用することもできますが、固定の "Origin" 文字列から取得します。たとえば、元の文字列としてすべてのスペースの任意の長さの文字列を使用できます。

とにかく、最初に既知の文字列の小さなサブセットでテストして、値が期待どおりに反映されているかどうかを確認します.

score -1 · Accepted Answer

これは、質問に対する「頭から離れた」答えです。

基本的に、これは距離文 2 が文 1 と異なる距離を計算します。距離は、2 つの文の単語間の最小レーベンシュタイン差の合計です。2 つの等しい文が 0 の距離を与えるという性質があります。

このアプローチが他の場所で公開されている場合、私はそれを知りません。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string str1 = "The cat sat on the mat";
            string str2 = "The quick brown fox jumped over the lazy cow";
            ReportDifference(str1, str1);
            ReportDifference(str2, str2);
            ReportDifference(str1, str2);
            ReportDifference(str2, str1);
        }
        /// <summary>
        /// Quick test andisplay routine
        /// </summary>
        /// <param name="str1">First sentence to test with</param>
        /// <param name="str2">Second sentence to test with</param>
        static void ReportDifference(string str1, string str2)
        {
            Debug.WriteLine(
                String.Format("difference between \"{0}\" and \"{1}\" is {2}", 
                str1, str2, Difference(str1, str2))); 
        }
        /// <summary>
        /// This does the hard work.
        /// Basically, what it does is:
        /// 1) Split the stings into tokens/words
        /// 2) Form a cartesian product of the 2 lists of words. 
        /// 3) Calculate the Levenshtein Distance between each word.
        /// 4) Group on the words from the first sentance
        /// 5) Get the min distance between the word in first sentence and all of the words from the second
        /// 6) Square the distances for each word. 
        ///     (based on the distance betwen 2 points is the sqrt of the sum of the x,y,... axises distances
        ///     what this assumes is the first word is the origin)
        /// 7) take the square root of sum
        /// </summary>
        /// <param name="str1">sentence 1 compare</param>
        /// <param name="str2">sentence 2 compare</param>
        /// <returns>distance calculated</returns>
        static double Difference(string str1, string str2)
        {
            string[] splitters = { " " };

            var a = Math.Sqrt(
                (from x in str1.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     from y in str2.Split(splitters, StringSplitOptions.RemoveEmptyEntries)
                     select new {x, y, ld = Distance.LD(x,y)} )
                    .GroupBy(x => x.x)
                    .Select(q => new { q.Key, min_match = q.Min(p => p.ld) })
                    .Sum(s =>  (double)(s.min_match * s.min_match )));
            return a;
        }
    }

    /// <summary>
    /// Lifted from http://www.merriampark.com/ldcsharp.htm
    /// </summary>
    public class Distance
    {

        /// <summary>
        /// Compute Levenshtein distance
        /// </summary>
        /// <param name="s">String 1</param>
        /// <param name="t">String 2</param>
        /// <returns>Distance between the two strings.
        /// The larger the number, the bigger the difference.
        /// </returns>
        public static int LD(string s, string t)
        {
            int n = s.Length; //length of s
            int m = t.Length; //length of t
            int[,] d = new int[n + 1, m + 1]; // matrix
            int cost; // cost
            // Step 1
            if (n == 0) return m;
            if (m == 0) return n;
            // Step 2
            for (int i = 0; i <= n; d[i, 0] = i++) ;
            for (int j = 0; j <= m; d[0, j] = j++) ;
            // Step 3
            for (int i = 1; i <= n; i++)
            {
                //Step 4
                for (int j = 1; j <= m; j++)
                {
                    // Step 5
                    cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
                    // Step 6
                    d[i, j] = System.Math.Min(System.Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                              d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    }
}

algorithm - 「絶対」文字列メトリック

寸法の更新

期待される結果の正確さに関する更新

ヒルベルト曲線アプローチの更新

8 に答える 8

Related

Reference