r - データのサブセットからランダムに抽出し、R で統計テストをブートストラップする方法

Question

2 つの変数を含むデータセットがあり、それらがブートストラップループで関連しているかどうかを統計的にテストしたいと考えています (つまり、Spearman のランク補正を使用してcor.test(...))。

私のデータセットの測定値のほとんどは、独立したサンプル単位 (単位を植物と呼びましょう) からのものですが、一部の測定値は同じ植物からのものです。疑似複製の問題に対処するために、テストの各実行で各プラントからの測定値を 1 つだけ使用して、統計テストを何度もブートストラップしたいと考えています。したがって、相関テストを実行する前に、プラントごとに 1 つの測定値をランダムに抽出するブートストラップループを作成する必要があります (その後、このプロセスを 99 回繰り返します)。

99 個のテストのそれぞれについて、p 値、rho、および S 統計を含む csv ファイルを作成したいと考えています。

サンプルデータ:

dput(df)

structure(list(Plant = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L, 
9L, 10L, 10L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 18L, 
19L, 20L, 21L), Length = c(170L, 232L, 123L, 190L, 112L, 207L, 
93L, 291L, 178L, 206L, 141L, 257L, 304L, 222L, 279L, 192L, 101L, 
253L, 176L, 278L, 311L, 129L, 191L, 205L, 226L), Count = c(7L, 
9L, 5L, 7L, 5L, 6L, 2L, 10L, 6L, 7L, 4L, 8L, 11L, 7L, 8L, 5L, 
5L, 9L, 7L, 6L, 9L, 4L, 5L, 7L, 6L)), .Names = c("Plant", "Length", 
"Count"), class = "data.frame", row.names = c(NA, -25L))


   Plant Length Count
1      1    170     7
2      2    232     9
3      3    123     5
4      4    190     7
5      5    112     5
6      6    207     6
7      6     93     2   
8      7    291    10  etc....

これまでのところ、複数の行で表される各植物の 1 つの行をランダムに描画することから始めて、統計テストを実行する前にこれらの値を残りのデータと組み合わせて、以下のコードをまとめました。ただし、統計テストを実行してサイクルを複数回実行するために、ブートストラップ関数 (boot()またはなど) を組み込むのに苦労しています。bootstrap()

# 1. create dataframe without plants with >1 measurement/row (in this example plant 6,10 & 18 have multiple rows)
df_uniq = df[ ! df$Plant %in% c(6,10,18), ]

# 2. create data subsets for each plant with >1 measurement/row
dup1 = df[6:7,]
dup2 = df[11:13,] 
dup3 = df[21:22,]

# 3. randomly draw one row for each plant with multiple measurements
d1_draw = dup1[sample(nrow(dup1), 1), ]
d2_draw = dup2[sample(nrow(dup2), 1), ]
d3_draw = dup3[sample(nrow(dup3), 1), ]

# 4. merge df_uniq with randomly drawn rows for each plant with multiple measurements
df_merge = rbind(df_uniq, d1_draw, d2_draw, d3_draw)

# 5. Test whether the two variables (length & Count) are related and write results to file
cor_res <- cor.test(df_merge$Length, df_merge$Count, method= "spearman")
write.csv(matrix(c(cor_res$statistic, cor_res$p.value, cor_res$estimate)), row.names=c("statistic", "p.value", "rho"), "test_output.csv")

問題を解決するための迅速かつエレガントな方法があると確信しています。どんな援助でも大歓迎です！どうもありがとう。

score 3 · Accepted Answer

そもそもなぜ一意の行を抽出するのですか? 行が 1 つしかない場合、その植物を 1 回サンプリングすると、その行は維持されますが、複数の行がある植物からはランダムにサンプリングされます。

したがって、次のことができます。

set.seed(123)
library(plyr)
ddply(df, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })

#   Plant Length Count howmany
#1      1    170     7       1
#2      2    232     9       1
#3      3    123     5       1
#4      4    190     7       1
#5      5    112     5       1
#6      6    207     6       2
#7      7    291    10       1
#8      8    178     6       1
#9      9    206     7       1
#10    10    257     8       3
#11    11    222     7       1
#12    12    279     8       1
#13    13    192     5       1
#14    14    101     5       1
#15    15    253     9       1
#16    16    176     7       1
#17    17    278     6       1
#18    18    311     9       2
#19    19    191     5       1
#20    20    205     7       1
#21    21    226     6       1

そしてあなたとcor.test

# first create your own function:
myrandomcors <- function(P){
ss <- ddply(P, .(Plant), function(x) { y <- x[sample(nrow(x), 1) ,] })
cor_res <- cor.test(ss$Length, ss$Count, method= "spearman")
return(c(stat = cor_res$statistic, p = cor_res$p.value, est = cor_res$estimate))
}

# then repeat it 5 times...
answer <- do.call( rbind, replicate(5, myrandomcors(df), simplify=FALSE ) )

#    > answer
#       stat.S            p   est.rho
#[1,] 352.4557 4.275291e-05 0.7711327
#[2,] 461.2733 4.060286e-04 0.7004719
#[3,] 340.2024 3.159626e-05 0.7790893
#[4,] 368.3967 6.227648e-05 0.7607814
#[5,] 342.4391 3.341956e-05 0.7776369

r - データのサブセットからランダムに抽出し、R で統計テストをブートストラップする方法

1 に答える 1

Related

Reference