r - ggplot2 を使用して、観測されたサンプルの平均/中央値の周りに信頼帯を構築するより良い方法

Question

したがって、Trials、Ind. Variable、Observation を持つ 3 列のデータフレームがあります。何かのようなもの：

df1<- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50))

95% conf をプロットしようとしています。次のようにかなり非効率的な方法を使用して、各試行の平均の周りの間隔:

    b<-NULL
    b$mean<- aggregate(Observation~Variable, data=df1,mean)[,2]
    b$sd  <- aggregate(Observation~Variable, data=df1,sd)[,2]
    b$Variable<- df1$Variable
    b$Observation <- df1$Observation 
    b$ucl <- rep(qnorm(.975, mean=b$mean, sd=b$sd), each=10)
    b$lcl <- rep(qnorm(.025, mean=b$mean, sd=b$sd), each=10)
    b<- as.data.frame(b)
    c <- ggplot(b, aes(Variable, Observation))  
    c + geom_point(color="red") + 
    geom_smooth(aes(ymin = lcl, ymax = ucl), data=b, stat="summary", fun.y="mean")

これは、ymin、ymax の値が重複するため非効率的です。geom_ribbon メソッドを見てきましたが、まだ複製する必要があります。ただし、glm などのスムージングを使用していた場合、コードははるかに単純になり、重複がなくなります。これを行うより良い方法はありますか？

参考文献: 1. ggplot を使用した R 信頼帯のプロット 2. ggplot2を使用した信頼区間の手動シェーディング 3. http://docs.ggplot2.org/current/geom_smooth.html

score 10 · Accepted Answer

この方法では、あなたの方法と同じ出力が得られます。これはggplot のドキュメントに触発されました。x繰り返しますが、各値が複数のポイントを持っている限り、これは意味があります。

set.seed(1)
df1 <- data.frame(Trial=rep(1:10,5), Variable=rep(1:5, each=10), Observation=rnorm(1:50))    my_ci <- function(x) data.frame(y=mean(x), ymin=mean(x)-2*sd(x), ymax=mean(x)+2*sd(x))

my_ci <- function(x) data.frame(
  y=mean(x), 
  ymin=mean(x) - 2 * sd(x), 
  ymax=mean(x) + 2 * sd(x)
)
ggplot(df1, aes(Variable, Observation)) + geom_point(color="red") +
  stat_summary(fun.data="my_ci", geom="smooth")

ここに画像の説明を入力

score 7 · Accepted Answer

パッケージには、パッケージggplot内の多数の要約関数のラッパーが付属しています。Hmisc

mean_cl_normalt 分布に基づいて信頼限界を計算します。
mean_cl_boot平均の分布を仮定しないブートストラップ法を使用し、
mean_sdl標準偏差の倍数を使用します (デフォルト = 2)。

この後者の方法は上記の回答と同じですが、95% CLではありません。t 分布に基づく信頼限界は、次の式で与えられます。

CL = t × s / √n

ここで、t は t 分布の適切な分位点で、s はサンプルの標準偏差です。信頼帯を比較します。

ggplot(df1, aes(x=Variable, y=Observation)) + 
  stat_summary(fun.data="mean_sdl", geom="line", colour="blue")+
  stat_summary(fun.data="mean_sdl", mult=2, geom="errorbar", 
               width=0.1, linetype=2, colour="blue")+
  geom_point(color="red") +
  labs(title=expression(paste(bar(x)," \u00B1 ","2 * sd")))

ggplot(df1, aes(x=Variable, y=Observation)) + 
  geom_point(color="red") +
  stat_summary(fun.data="mean_cl_normal", geom="line", colour="blue")+
  stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar", 
               width=0.1, linetype=2, colour="blue")+
  stat_summary(fun.data="mean_cl_normal", geom="point", size=3, 
               shape=1, colour="blue")+
  labs(title=expression(paste(bar(x)," \u00B1 ","t * sd / sqrt(n)")))

最後に、この最後のプロットをを使用して回転するcoord_flip()と、 a に非常に近いものが生成されますForest Plot。これは、データを要約するための標準的な方法です。

ggplot(df1, aes(x=Variable, y=Observation)) + 
  geom_point(color="red") +
  stat_summary(fun.data="mean_cl_normal", conf.int=0.95, geom="errorbar", 
               width=0.2, colour="blue")+
  stat_summary(fun.data="mean_cl_normal", geom="point", size=3, 
               shape=1, colour="blue")+
  geom_hline(aes(yintercept=mean(Observation)), linetype=2)+
  labs(title="Forest Plot")+
  coord_flip()

r - ggplot2 を使用して、観測されたサンプルの平均/中央値の周りに信頼帯を構築するより良い方法

2 に答える 2

Related

Reference