java - Kmean クラスタリングの可視化

Question

12 次元のマトリックスで KMean クラスタリングを行っています。クラスターのKセットで結果を得ることができました。結果を2Dグラフにプロットして表示したいのですが、12次元のデータを2次元に変換する方法がわかりません。

変換を行う方法や、結果を視覚化するための代替方法に関する提案はありますか? Multidimensional Scaling for Java (MDSJ)を試しましたが、うまくいきませんでした。

私が使用している KMean アルゴリズムは、Java Machine Learning Library: Clustering basicsからのものです。

score 1 · Accepted Answer

主成分分析を行います(おそらく、多次元スケーリングアルゴリズムの中で最も簡単なアルゴリズムです)。(ところで、PCA は KMeans とは関係ありません。次元削減の一般的な方法です)

変数は列にあり、観測は行にあると仮定します。
データを標準化 - 変数を Z スコアに変換します。つまり、各セルから列の平均を引き、その結果を std で割ります。列のずれ。そうすれば、ゼロ平均と単位分散が得られます。前者は義務であり、後者は良いことだと思います。分散がゼロの場合は、共分散行列から固有ベクトルを計算します。そうでない場合は、データを自動的に標準化する相関行列を使用する必要があります。説明はこちらをご覧ください）。
共分散行列の固有ベクトルと固有値を計算します。固有ベクトルを固有値で並べ替えます。(多くのライブラリは、そのようにソートされた固有ベクトルを既に提供しています)。
固有ベクトル行列の最初の 2 列を使用し、元の行列 (z スコアに変換) を乗算して、このデータを視覚化します。

coltライブラリを使用すると、次のことができます。他のマトリックスライブラリと同様です。

    import cern.colt.matrix.DoubleMatrix1D;
    import cern.colt.matrix.DoubleMatrix2D;
    import cern.colt.matrix.doublealgo.Statistic;
    import cern.colt.matrix.impl.SparseDoubleMatrix2D;
    import cern.colt.matrix.linalg.Algebra;
    import cern.colt.matrix.linalg.EigenvalueDecomposition;
    import hep.aida.bin.DynamicBin1D;

    public class Pca {
        // to show matrix creation, it does not make much sense to calculate PCA on random data
        public static void main(String[] x) {
            double[][] data = {
                {2.0,4.0,1.0,4.0,4.0,1.0,5.0,5.0,5.0,2.0,1.0,4.0}, 
                {2.0,6.0,3.0,1.0,1.0,2.0,6.0,4.0,4.0,4.0,1.0,5.0},
                {3.0,4.0,4.0,4.0,2.0,3.0,5.0,6.0,3.0,1.0,1.0,1.0},
                {3.0,6.0,3.0,3.0,1.0,2.0,4.0,6.0,1.0,2.0,4.0,4.0}, 
                {1.0,6.0,4.0,2.0,2.0,2.0,3.0,4.0,6.0,3.0,4.0,1.0}, 
                {2.0,5.0,5.0,3.0,1.0,1.0,6.0,6.0,3.0,2.0,6.0,1.0}
            };

            DoubleMatrix2D matrix = new DenseDoubleMatrix2D(data);

            DoubleMatrix2D pm = pcaTransform(matrix);

            // print the first two dimensions of the transformed matrix - they capture most of the variance of the original data
            System.out.println(pm.viewPart(0, 0, pm.rows(), 2).toString());
        }

        /** Returns a matrix in the space of principal components, take the first n columns  */
        public static DoubleMatrix2D pcaTransform(DoubleMatrix2D matrix) {
            DoubleMatrix2D zScoresMatrix = toZScores(matrix);
            final DoubleMatrix2D covarianceMatrix = Statistic.covariance(zScoresMatrix);

            // compute eigenvalues and eigenvectors of the covariance matrix (flip needed since it is sorted by ascending).
            final EigenvalueDecomposition decomp = new EigenvalueDecomposition(covarianceMatrix);

            // Columns of Vs are eigenvectors = principal components = base of the new space; ordered by decreasing variance
            final DoubleMatrix2D Vs = decomp.getV().viewColumnFlip(); 

            // eigenvalues: ev(i) / sum(ev) is the percentage of variance captured by i-th column of Vs
            // final DoubleMatrix1D ev = decomp.getRealEigenvalues().viewFlip();

            // project the original matrix to the pca space
            return Algebra.DEFAULT.mult(zScoresMatrix, Vs);
        }


        /**
         * Converts matrix to a matrix of z-scores (by columns)
         */
        public static DoubleMatrix2D toZScores(final DoubleMatrix2D matrix) {
            final DoubleMatrix2D zMatrix = new SparseDoubleMatrix2D(matrix.rows(), matrix.columns());
            for (int c = 0; c < matrix.columns(); c++) {
                final DoubleMatrix1D column = matrix.viewColumn(c);
                final DynamicBin1D bin = Statistic.bin(column);

                if (bin.standardDeviation() == 0) {   // use epsilon
                    for (int r = 0; r < matrix.rows(); r++) {
                        zMatrix.set(r, c, 0.0);
                    }
                } else {
                    for (int r = 0; r < matrix.rows(); r++) {
                        double zScore = (column.get(r) - bin.mean()) / bin.standardDeviation();
                        zMatrix.set(r, c, zScore);
                    }
                }
            }

            return zMatrix;
        }
    }

weka を使用することもできます。最初にデータをwekaにロードし、次にGUIを使用してPCAを実行します（属性選択の下）。どのクラスがどのパラメーターで呼び出されるかを確認し、コードから同じことを行います。問題は、行列を weka が動作するデータ形式に変換/ラップする必要があることです。

score 0 · Accepted Answer

他の回答が示唆することに加えて、おそらく多次元スケーリングも検討する必要があります。

score 0 · Accepted Answer

同様の質問がCrossValidated 2で議論されています。基本的な考え方は、これらのクラスターを分離する適切な射影を見つけ (たとえば、discprojin でR)、新しい空間のクラスターに射影をプロットすることです。

java - Kmean クラスタリングの可視化

3 に答える 3

Related

Reference