r - データテーブルを使用したサブセットに対する操作の実行

Question

ワイド形式の調査データセットがあります。特定の質問について、調査の質問が特定の月に行われたという事実を表すために、一連の変数が生データで作成されました。

月不変の名前を持つ変数の新しいセットを作成したいと考えています。これらの変数の値は、観測された月の月別質問の値に対応します。

例/架空のデータセットを参照してください:

require(data.table)

data <- data.table(month = rep(c('may', 'jun', 'jul'),  each = 5),
                   may.q1 = rep(c('yes', 'no', 'yes'),  each = 5),
                   jun.q1 = rep(c('breakfast', 'lunch', 'dinner'),  each = 5),
                   jul.q1 = rep(c('oranges', 'apples', 'oranges'),  each = 5),
                   may.q2 = rep(c('econ', 'math', 'science'), each = 5),
                   jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
                   jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5))

このアンケートでは、実際には「q1」と「q2」の 2 つの質問しかありません。これらの質問はそれぞれ、数か月間繰り返し尋ねられます。ただし、データで観察された月が特定の月の調査質問と一致する場合にのみ、観察に有効な回答が含まれます。

例: "may.q1" は、"5 月" のすべての観測に対して "yes" として観測されます。「may.q1」、「jun.q1」、および「jul.q1」を表す新しい「Q1」変数が必要です。月が「may」の場合、「Q1」の値は「may.q1」の値になり、月が「jun」の場合、「Q1」の値は「jun.q1」の値になります。 .

データテーブルを使用して手動でこれを実行しようとすると、次のようなものが必要になります。

mdata <- data[month == 'may', c('month', 'may.q1', 'may.q2'), with = F]
setnames(mdata, names(mdata), gsub('may\\.', '', names(mdata)))

これを「by = month」で繰り返したいと思います。

データフレームに「plyr」パッケージを使用する場合、次のアプローチを使用して解決します。

require(plyr)
data <- data.frame(data)

mdata <- ddply(data, .(month), function(dfmo) {
    dfmo <- dfmo[, c(1, grep(dfmo$month[1], names(dfmo)))]
    names(dfmo) <- gsub(paste0(dfmo$month[1], '\\.'), '', names(dfmo))
    return(dfmo)
})

私のデータは大きいので、 data.table メソッドを使用して助けていただければ幸いです。ありがとうございました。

score 6 · Accepted Answer

説明する別の方法：

data[, .SD[,paste0(month,c(".q1",".q2")), with=FALSE], by=month]

    month  may.q1     may.q2
 1:   may     yes       econ
 2:   may     yes       econ
 3:   may     yes       econ
 4:   may     yes       econ
 5:   may     yes       econ
 6:   jun   lunch      foggy
 7:   jun   lunch      foggy
 8:   jun   lunch      foggy
 9:   jun   lunch      foggy
10:   jun   lunch      foggy
11:   jul oranges heavy rain
12:   jul oranges heavy rain
13:   jul oranges heavy rain
14:   jul oranges heavy rain
15:   jul oranges heavy rain

ただし、列名は最初のグループに由来することに注意してください (後でを使用して名前を変更できますsetnames)。また、少数の列しか必要としない多数の列がある場合、最も効率的ではない可能性があります。その場合、長い形式に溶ける Arun のソリューションはより高速になるはずです。

score 1 · Accepted Answer

このようなものはどうですか

data <- data.table(
                   may.q1 = rep(c('yes', 'no', 'yes'),  each = 5),
                   jun.q1 = rep(c('breakfast', 'lunch', 'dinner'),  each = 5),
                   jul.q1 = rep(c('oranges', 'apples', 'oranges'),  each = 5),
                   may.q2 = rep(c('econ', 'math', 'science'), each = 5),
                   jun.q2 = rep(c('sunny', 'foggy', 'cloudy'), each = 5),
                   jul.q2 = rep(c('no rain', 'light mist', 'heavy rain'), each = 5)
                   )


tmp <- reshape(data, direction = "long", varying = 1:6, sep = ".", timevar = "question")

str(tmp)
## Classes ‘data.table’ and 'data.frame':   30 obs. of  5 variables:
##  $ question: chr  "q1" "q1" "q1" "q1" ...
##  $ may     : chr  "yes" "yes" "yes" "yes" ...
##  $ jun     : chr  "breakfast" "breakfast" "breakfast" "breakfast" ...
##  $ jul     : chr  "oranges" "oranges" "oranges" "oranges" ...
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...

さらに進んでこのデータを再度融解したい場合は、melt パッケージを使用できます。

require(reshape2)
## remove the id column if you want (id is the last col so ncol(tmp))
res <- melt(tmp[,-ncol(tmp), with = FALSE], measure.vars = c("may", "jun", "jul"), value.name = "response", variable.name = "month")

str(res)
## 'data.frame':    90 obs. of  3 variables:
##  $ question: chr  "q1" "q1" "q1" "q1" ...
##  $ month   : Factor w/ 3 levels "may","jun","jul": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response: chr  "yes" "yes" "yes" "yes" ...

r - データ テーブルを使用したサブセットに対する操作の実行

3 に答える 3

Related

Reference

r - データテーブルを使用したサブセットに対する操作の実行