apache-spark - Pyspark と PCA: この PCA の固有ベクトルを抽出するにはどうすればよいですか? 彼らが説明している分散の量をどのように計算できますか?

Question

次のように、pyspark (ライブラリを使用) を使用してSpark DataFramewithモデルの次元を削減しています。PCAspark ml

pca = PCA(k=3, inputCol="features", outputCol="pca_features")
model = pca.fit(data)

ここdataで、はラベルが付けられたSpark DataFrame1 つの列で、3 次元のです。featuresDenseVector

data.take(1)
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1')

フィッティング後、データを変換します。

transformed = model.transform(data)
transformed.first()
Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625]))

この PCA の固有ベクトルを抽出するにはどうすればよいですか? 彼らが説明している分散の量をどのように計算できますか?

score 31 · Accepted Answer

[更新: Spark 2.2 以降、PCA と SVD の両方が PySpark で利用可能です - JIRA チケットSPARK-6227とSpark ML 2.2 のPCA & PCAModelを参照してください。以下の元の回答は、古い Spark バージョンにも適用できます。]

信じられないことのように思えますが、実際には、PCA 分解からそのような情報を抽出する方法はありません (少なくとも Spark 1.5 の時点では)。しかし、繰り返しますが、多くの同様の「苦情」がありました。たとえば、CrossValidatorModel.

幸いなことに、数か月前に、 AMPLab (バークレー) と Databricks (つまり、Spark の作成者) による「スケーラブルな機械学習」 MOOC に参加し、宿題の一部として完全な PCA パイプラインを「手作業で」実装しました。私は当時から自分の関数を変更しました（安心してください、私は完全な信用を得ました:-)、データフレームを入力として（RDDの代わりに）、あなたのものと同じ形式（つまりDenseVectors、数値機能を含む行）。

estimatedCovarianceまず、次のように中間関数を定義する必要があります。

import numpy as np

def estimateCovariance(df):
    """Compute the covariance matrix for a given dataframe.

    Note:
        The multi-dimensional covariance array should be calculated using outer products.  Don't
        forget to normalize the data by first subtracting the mean.

    Args:
        df:  A Spark dataframe with a column named 'features', which (column) consists of DenseVectors.

    Returns:
        np.ndarray: A multi-dimensional array where the number of rows and columns both equal the
            length of the arrays in the input dataframe.
    """
    m = df.select(df['features']).map(lambda x: x[0]).mean()
    dfZeroMean = df.select(df['features']).map(lambda x:   x[0]).map(lambda x: x-m)  # subtract the mean

    return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count()

次に、次のようにメインpca関数を記述できます。

from numpy.linalg import eigh

def pca(df, k=2):
    """Computes the top `k` principal components, corresponding scores, and all eigenvalues.

    Note:
        All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns
        each eigenvectors as a column.  This function should also return eigenvectors as columns.

    Args:
        df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
        k (int): The number of principal components to return.

    Returns:
        tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A tuple of (eigenvectors, `RDD` of
        scores, eigenvalues).  Eigenvectors is a multi-dimensional array where the number of
        rows equals the length of the arrays in the input `RDD` and the number of columns equals
        `k`.  The `RDD` of scores has the same number of rows as `data` and consists of arrays
        of length `k`.  Eigenvalues is an array of length d (the number of features).
     """
    cov = estimateCovariance(df)
    col = cov.shape[1]
    eigVals, eigVecs = eigh(cov)
    inds = np.argsort(eigVals)
    eigVecs = eigVecs.T[inds[-1:-(col+1):-1]]  
    components = eigVecs[0:k]
    eigVals = eigVals[inds[-1:-(col+1):-1]]  # sort eigenvals
    score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) )
    # Return the `k` principal components, `k` scores, and all eigenvalues

    return components.T, score, eigVals

テスト

最初に、Spark ML PCAドキュメントのサンプルデータを使用して、既存のメソッドの結果を見てみましょう( all になるように変更しますDenseVectors)。

 from pyspark.ml.feature import *
 from pyspark.mllib.linalg import Vectors
 data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),),
         (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
         (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
 df = sqlContext.createDataFrame(data,["features"])
 pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")
 model = pca_extracted.fit(df)
 model.transform(df).collect()

 [Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])),
  Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])),
  Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))]

次に、私たちの方法で：

 comp, score, eigVals = pca(df)
 score.collect()

 [array([ 1.64857282,  4.0132827 ]),
  array([-4.64510433,  1.11679727]),
  array([-6.42888054,  5.33795143])]

定義した関数でメソッドを使用していないことを強調させてください-は、そうあるべきです。collect()scoreRDD

2 番目の列の符号はすべて、既存の方法で導出されたものとは反対であることに注意してください。しかし、これは問題ではありません: (無料でダウンロードできる) An Introduction to Statistical Learning によると、Hastie & Tibshirani 共著、p. 382

各主成分負荷ベクトルは、符号反転まで一意です。これは、2 つの異なるソフトウェアパッケージが同じ主成分負荷ベクトルを生成することを意味しますが、これらの負荷ベクトルの符号は異なる場合があります。各主成分負荷ベクトルは p 次元空間で方向を指定するため、符号が異なる場合があります。方向が変わらないため、符号を反転しても効果はありません。[...] 同様に、Z の分散は −Z の分散と同じであるため、スコアベクトルは符号反転まで一意です。

最後に、固有値が利用可能になったので、説明された分散のパーセンテージの関数を書くのは簡単です。

 def varianceExplained(df, k=1):
     """Calculate the fraction of variance explained by the top `k` eigenvectors.

     Args:
         df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors.
         k: The number of principal components to consider.

     Returns:
         float: A number between 0 and 1 representing the percentage of variance explained
             by the top `k` eigenvectors.
     """
     components, scores, eigenvalues = pca(df, k)  
     return sum(eigenvalues[0:k])/sum(eigenvalues)

 
 varianceExplained(df,1)
 # 0.79439325322305299

テストとして、例のデータで説明されている分散が k=5 の場合に 1.0 であるかどうかも確認します (元のデータは 5 次元であるため)。

 varianceExplained(df,5)
 # 1.0

[Spark 1.5.0 & 1.5.1 で開発およびテスト済み]

score 14 · Accepted Answer

編集：

PCAこの解決済みの JIRA チケットSPARK-6227に従って、最後にSVD両方ともpyspark開始spark 2.2.0で利用可能です。

元の答え:

@desertnaut によって与えられた答えは、実際には理論的な観点からは優れていますが、SVD を計算して固有ベクトルを抽出する方法について別のアプローチを提示したかったのです。

from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper
from pyspark.mllib.linalg.distributed import RowMatrix

class SVD(JavaModelWrapper):
    """Wrapper around the SVD scala case class"""
    @property
    def U(self):
        """ Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True."""
        u = self.call("U")
        if u is not None:
        return RowMatrix(u)

    @property
    def s(self):
        """Returns a DenseVector with singular values in descending order."""
        return self.call("s")

    @property
    def V(self):
        """ Returns a DenseMatrix whose columns are the right singular vectors of the SVD."""
        return self.call("V")

これにより、SVD オブジェクトが定義されます。Java ラッパーを使用して computeSVD メソッドを定義できるようになりました。

def computeSVD(row_matrix, k, computeU=False, rCond=1e-9):
    """
    Computes the singular value decomposition of the RowMatrix.
    The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where
    * s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order.
    * U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A')
    * v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A)
    :param k: number of singular values to keep. We might return less than k if there are numerically zero singular values.
    :param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1
    :param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value.
    :returns: SVD object
    """
    java_model = row_matrix._java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond))
    return SVD(java_model)

それでは、それを例に適用しましょう:

from pyspark.ml.feature import *
from pyspark.mllib.linalg import Vectors

data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = sqlContext.createDataFrame(data,["features"])

pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features")

model = pca_extracted.fit(df)
features = model.transform(df) # this create a DataFrame with the regular features and pca_features

# We can now extract the pca_features to prepare our RowMatrix.
pca_features = features.select("pca_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)

# Once the RowMatrix is ready we can compute our Singular Value Decomposition
svd = computeSVD(mat,2,True)
svd.s
# DenseVector([9.491, 4.6253])
svd.U.rows.collect()
# [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])]
svd.V
# DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)

score 3 · Accepted Answer

あなたの質問に対する最も簡単な答えは、単位行列をモデルに入力することです。

identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \
              (Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),),
              (Vectors.dense([.0, 0.0, .0, .0, 1.0]),)]
df_identity = sqlContext.createDataFrame(identity_input,["features"])
identity_features = model.transform(df_identity)

これにより、主成分が得られます。

Desertnaut は Spark のアクションの代わりに numpy の関数を使用して問題を解決しているため、Spark フレームワークに関しては、elisah の回答の方が優れていると思います。ただし、エリアサの答えにはデータの正規化がありません。したがって、エリアサの回答に次の行を追加します。

from pyspark.ml.feature import StandardScaler
standardizer = StandardScaler(withMean=True, withStd=False,
                          inputCol='features',
                          outputCol='std_features')
model = standardizer.fit(df)
output = model.transform(df)
pca_features = output.select("std_features").rdd.map(lambda row : row[0])
mat = RowMatrix(pca_features)
svd = computeSVD(mat,5,True)

当然、svd.V と identity_features.select("pca_features").collect() は同じ値を持つ必要があります。

PCA と、Spark および sklearn での PCA の使用について、このブログ投稿にまとめました。

apache-spark - Pyspark と PCA: この PCA の固有ベクトルを抽出するにはどうすればよいですか? 彼らが説明している分散の量をどのように計算できますか?

4 に答える 4

Related

Reference