r - 条件付きでランダムにサンプリングする

Question

私の問題はループ内にあります。大きなデータセット (DF) があり、そのサブセットは次のようになります。

ID     Site Species
101     4   x
101     4   y
101     4   z
102     6   x
102     6   z
102     6   a
102     6   b
103     6   a
103     6   z
103     6   c
103     6   x
103     6   y
105     6   x
105     6   y
105     6   a
105     6   z
108     1   x
108     1   a
108     1   c
108     1   z

ループの各反復 (so, ) を使用して、各サイトからi個人のすべての行をランダムに選択したいと思います。IDただし、重要なのは、各サイトから ID を 1 つだけにすることです。サイトの数に合わせて大規模なデータセットをサブセット化する別の関数があるためi=1、上記のサイトの 1 つだけ (たとえば) がサブセットに存在する場合。

このi=3投稿された例のように、101 のすべての行、および 102、103 または 105 のすべての行、および 108 のすべての行が必要な場合。

ddply()withのようなことができると思いますsample()が、ランダムに発生させることはできません。

どんな提案でも大歓迎です。ありがとう

ジェームズ

score 1 · Accepted Answer

unique考えられるすべての ID / サイトを見つけてから、一意およびサブセットからサンプリングするために使用できると思います。

たとえば、データセットを作成しましょう

# Set the RNG seed for reproducibility
set.seed(12345)
ID <- rep(100:110, c(2, 6, 3, 1, 3, 8, 9, 2, 4, 5, 6))
site <- rep(1:6, c(8, 7, 8, 11, 4, 11))
species <- sample(letters[1:5], length(ID), replace=T)

df <- data.frame(ID=ID, Site=site, Species=species)

したがって、df は次のようになります。

> head(df, 15)
    ID Site Species
1  100    1       d
2  100    1       e
3  101    1       d
4  101    1       e
5  101    1       c
6  101    1       a
7  101    1       b
8  101    1       c
9  102    2       d
10 102    2       e
11 102    2       a
12 103    2       a
13 104    2       d
14 104    2       a
15 104    2       b

データを要約すると、次のようになります。

Site 1 -> 100, 101
Site 2 -> 102, 103, 104
Site 3 -> 105
Site 4 -> 106, 107
Site 5 -> 108
Site 6 -> 109, 110

ここで、3 つのサイトから選択したいとします。

# The number of sites we want to sample
num.sites <- 3
# Find all the sites
all.sites <- unique(df$Site)
# Pick the sites. 
# You may also want to check that num.sites <= length(all.sites)
sites <- sample(all.sites, num.sites)

この場合、私たちが選んだ

> sites
[1] 4 5 6

これで、各サイトで使用可能な ID が見つかりました

# Now find the IDs in each of those sites
# simplify=F is VERY important to ensure we get a list even if every
# site has the same number of IDs
IDs <- sapply(chosen.sites, function(s)
    {
    unique(df$ID[df$Site==s])
    }, simplify=FALSE)

それは私たちに与えます

> IDs
[[1]]
[1] 106 107

[[2]]
[1] 108

[[3]]
[1] 109 110

サイトごとに 1 つの ID を選択します

# NOTE: this assumes the same ID is not found in multiple sites
# but it's easy to deal with the opposite case
# Again, we return a list, because sapply does not seem 
# to play well with data frames... (try it!)
res <- sapply(IDs, function(i)
  {
  chosen.ID <- sample(as.list(i), 1)
  df[df$ID==chosen.ID,]
  }, simplify=FALSE)

# Finally convert the list to a data frame
res <- do.call(rbind, res)


> res
    ID Site Species
24 106    4       d
25 106    4       d
26 106    4       b
27 106    4       d
28 106    4       c
29 106    4       b
30 106    4       c
31 106    4       d
32 106    4       a
35 108    5       b
36 108    5       b
37 108    5       e
38 108    5       e
44 110    6       d
45 110    6       b
46 110    6       b
47 110    6       a
48 110    6       a
49 110    6       a

したがって、すべてが単一の関数に

pickSites <- function(df, num.sites)
    {
    all.sites <- unique(df$Site)
    chosen.sites <- sample(all.sites, num.sites)

    IDs <- sapply(chosen.sites, function(s)
        {
        unique(df$ID[df$Site==s])
        }, simplify=FALSE)

    res <- sapply(IDs, function(i)
        {
        chosen.ID <- sample(as.list(i), 1)
        df[df$ID==chosen.ID,]
        }, simplify=FALSE)

    res <- do.call(rbind, res)
    }

r - 条件付きでランダムにサンプリングする

2 に答える 2

Related

Reference