r - 回帰とPCAの視覚的比較

Question

SOで別の角度から議論されているブログCerebralMasticationに触発されて、回帰とPCAを比較する方法を完成させようとしています。忘れる前に、これの核心の多くをJDLongとJoshUlrichに感謝します。これを次の学期のコースで使用します。すみません、これは長いです！

更新：ほとんど機能する別のアプローチを見つけました（可能であれば修正してください！）。一番下に投稿しました。私が思いついたよりもはるかに賢くて短いアプローチ！

私は基本的に、以前のスキームをある程度まで実行しました。ランダムデータを生成し、最適な線を見つけ、残差を描画します。これは、以下の2番目のコードチャンクに示されています。しかし、私はまた、ランダムな点（この場合はデータ点）を通る線に垂直な線を描くためのいくつかの関数を掘り下げて書きました。これらは正常に機能すると思います。これらは、機能する証拠とともにFirstCodeChunkに表示されます。

ここで、2番目のコードチャンクは@JDLongと同じフローを使用してすべての動作を示し、結果のプロットの画像を追加しています。黒、赤のデータは残差がピンクの回帰であり、青は1番目のPCであり、水色は法線である必要がありますが、明らかにそうではありません。これらの法線を描画するFirstCodeChunkの関数は問題ないように見えますが、デモンストレーションでは何かが正しくありません。何かを誤解しているか、間違った値を渡しているに違いないと思います。私の法線は水平になっています。これは便利な手がかりのようです（しかし、これまでのところ、私にはわかりません）。誰かがここで何が悪いのかわかりますか？

おかげで、これはしばらくの間私を悩ませてきました... 問題を示すプロット

最初のコードチャンク（法線を描画し、それらが機能することを証明する関数）：

##### The functions below are based very loosely on the citation at the end

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

# The following proves that pointOnLineNearPoint
# and pointOnLine work properly and accept vectors

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted
# and right angles don't appear as right angles

m <- runif(1, -5, 5)
b <- runif(1, -20, 20)
plot(-20:20, -20:20, type = "n", xlab = "x values", ylab = "y values")
abline(b, m )

Px <- rnorm(10, 0, 4)
Py <- rnorm(10, 0, 4)

res <- pointOnLineNearPoint(Px, Py, m, b)
points(Px, Py, col = "red")
segments(Px, Py, res[,1], res[,2], col = "blue")

##========================================================
##
##  Credits:
##  Theory by Paul Bourke http://local.wasp.uwa.edu.au/~pbourke/geometry/pointline/
##  Based in part on C code by Damian Coventry Tuesday, 16 July 2002
##  Based on VBA code by Brandon Crosby 9-6-05 (2 dimensions)
##  With grateful thanks for answering our needs!
##  This is an R (http://www.r-project.org) implementation by Gregoire Thomas 7/11/08
##
##========================================================

2番目のコードチャンク（デモンストレーションをプロット）：

set.seed(55)
np <- 10 # number of data points
x <- 1:np
e <- rnorm(np, 0, 60)
y <- 12 + 5 * x + e

par(mar = c(4, 4, 4, 4)) # otherwise the plot is slightly distorted

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals")
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# pca "by hand"
xyNorm <- cbind(x = x - mean(x), y = y - mean(y)) # mean centers
xyCov <- cov(xyNorm)
eigenValues <- eigen(xyCov)$values
eigenVectors <- eigen(xyCov)$vectors

# Add the first PC by denormalizing back to original coords:

new.y <- (eigenVectors[2,1]/eigenVectors[1,1] * xyNorm[x]) + mean(y)
lines(x, new.y, col = "blue", lwd = 2)

# Now add the normals

yx2.lm <- lm(new.y ~ x) # zero residuals: already a line
res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])
points(res[,1], res[,2], col = "blue", pch = 20) # segments should end here
segments(x, y, res[,1], res[,2], col = "lightblue1") # the normals

＃＃＃＃＃＃＃＃＃＃＃＃アップデート

Vincent Zoonekyndのページで、私が望んでいたものをほぼ正確に見つけました。しかし、それは完全には機能しません（明らかに機能するために使用されます）。これは、垂直軸を介して反射された最初のPCの法線をプロットするそのサイトからのコードの抜粋です。

set.seed(1)
x <- rnorm(20)
y <- x + rnorm(20)
plot(y~x, asp = 1)
r <- lm(y~x)
abline(r, col='red')

r <- princomp(cbind(x,y))
b <- r$loadings[2,1] / r$loadings[1,1]
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue")
title(main='Appears to use the reflection of PC1')

u <- r$loadings
# Projection onto the first axis
p <- matrix( c(1,0,0,0), nrow=2 )
X <- rbind(x,y)
X <- r$center + solve(u, p %*% u %*% (X - r$center))
segments( x, y, X[1,], X[2,] , col = "lightblue1")

そしてここに結果があります：

ここに画像の説明を入力してください

score 7 · Accepted Answer

了解しました。自分の質問に答える必要があります。人々がインターネット上で行っている方法をさらに読んで比較した後、私は問題を解決しました。何度も繰り返したので、何を「修正」したかを明確に述べることができるかどうかはわかりません。とにかく、ここにプロットとコード（MWE）があります。わかりやすくするために、ヘルパー関数は最後にあります。

作業デモ

# Comparison of Linear Regression & PCA
# Generate sample data

set.seed(39) # gives a decent-looking example
np <- 10 # number of data points
x <- -np:np
e <- rnorm(length(x), 0, 10)
y <- rnorm(1, 0, 2) * x + 3*rnorm(1, 0, 2) + e

# Plot the main data & residuals

plot(x, y, main = "Regression minimizes the y-residuals & PCA the normals", asp = 1)
yx.lm <- lm(y ~ x)
lines(x, predict(yx.lm), col = "red", lwd = 2)
segments(x, y, x, fitted(yx.lm), col = "pink")

# Now the PCA using built-in functions
# rotation = loadings = eigenvectors

r <- prcomp(cbind(x,y), retx = TRUE)
b <- r$rotation[2,1] / r$rotation[1,1] # gets slope of loading/eigenvector 1
a <- r$center[2] - b * r$center[1]
abline(a, b, col = "blue") # Plot 1st PC

# Plot normals to 1st PC

X <- pointOnLineNearPoint(x, y, b, a)
segments( x, y, X[,1], X[,2], col = "lightblue1")

###### Needed Functions

pointOnLineNearPoint <- function(Px, Py, slope, intercept) {
    # Px, Py is the point to test, can be a vector.
    # slope, intercept is the line to check distance.

    Ax <- Px-10*diff(range(Px))
    Bx <- Px+10*diff(range(Px))
    Ay <- Ax * slope + intercept
    By <- Bx * slope + intercept
    pointOnLine(Px, Py, Ax, Ay, Bx, By)
    }

pointOnLine <- function(Px, Py, Ax, Ay, Bx, By) {

    # This approach based upon comingstorm's answer on
    # stackoverflow.com/questions/3120357/get-closest-point-to-a-line
    # Vectorized by Bryan

    PB <- data.frame(x = Px - Bx, y = Py - By)
    AB <- data.frame(x = Ax - Bx, y = Ay - By)
    PB <- as.matrix(PB)
    AB <- as.matrix(AB)
    k_raw <- k <- c()
    for (n in 1:nrow(PB)) {
        k_raw[n] <- (PB[n,] %*% AB[n,])/(AB[n,] %*% AB[n,])
        if (k_raw[n] < 0)  { k[n] <- 0
            } else { if (k_raw[n] > 1) k[n] <- 1
                else k[n] <- k_raw[n] }
        }
    x = (k * Ax + (1 - k)* Bx)
    y = (k * Ay + (1 - k)* By)
    ans <- data.frame(x, y)
    ans
    }

score 1 · Accepted Answer

コードの次の行を変更してみてください。

res <- pointOnLineNearPoint(x, y, yx2.lm$coef[2], yx2.lm$coef[1])

に

res <- pointOnLineNearPoint(x, new.y, yx2.lm$coef[2], yx2.lm$coef[1])

したがって、正しいy値を呼び出しています。

score 1 · Accepted Answer

Vincent Zoonekyndのコードで、行u <- r$loadingsをに変更しu <- solve(r$loadings)ます。の2番目のインスタンスではsolve()、最初の主軸に沿った予測コンポーネントスコア（つまり、2番目の予測コンポーネントスコアがゼロに設定された予測スコアのマトリックス）に、負荷/固有ベクトルの逆数を掛ける必要があります。データに負荷を掛けると、予測スコアが得られます。予測スコアを負荷で割るとデータが得られます。お役に立てば幸いです。

r - 回帰とPCAの視覚的比較

3 に答える 3

Related

Reference