r - Rを使用して、大きな複数列の度数分布表を作成します

Question

これを効率的に行うのに苦労しており、基本的な質問である場合はお詫び申し上げます。多数のバイナリ変数間の関係を、他の要約統計量なしで、単純に頻度とパーセントで要約するために、Nとパーセントで分割表を作成する必要があります。

具体的には、サンプルタイプXと臨床転帰Yを持つ患者の数を要約します。患者は、任意の数の転帰と任意の数のサンプルを持つことができます。つまり、各変数は相互に排他的ではなく、独立しています。

すべての結果（死亡、ICU入室、脚の脱落など）を列に、すべてのサンプルタイプ（血清、尿など）を行にしたいと思います。「陽性」反応の頻度と割合、つまり死亡して尿サンプルを採取した患者のNと割合のみをリストする必要があります。

この種のテーブルに役立つパッケージはありますか？私が見つけたものはすべて、優れた1xN変数分割表を作成するのに適しています。どういうわけかその出力の列を抽出し、それらをマスターテーブルにバインドしてすべてを支配することができれば、結果ごとに個別のテーブルを作成してもかまいません。もう1つのアイデアは、2つのmChoice（Hmiscパッケージ）変数の度数分布表を作成することです。これら2つの戦略のどちらかが可能かどうかはわかりません。

何か案は？

私が探しているのは次のようなものです。

+-------------+--------+---------+
|             | Death  | ICU     |
|             | (N=10) | (N=50)  |
+-------------+--------+---------+
|Serum (N=50) |5 (50%) | 30 (60%)|
+-------------+--------+---------+
|Urine (N=40) |10(100%)| 7 (14%) |
+-------------+--------+---------+
|Brain (N=25) |6 (60%) | 15 (30%)|
+-------------+--------+---------+
|Kidney (N=50)|7 (70%) | 40 (80%)|
+-------------+--------+---------+

サンプルデータを含めるように編集します。

set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

score 2 · Accepted Answer

data.tableこれは、パッケージを使用したシンプルで高速なソリューションです。

library(data.table)

# convert your data frame to data.table
  setDT(df)


# create the output for serum
  serum <- df[serum==1, .(test="serum",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(serum)]

# create the output for kidney
  kidney<- df[kidney==1, .(test="kidney",
                          test.N = .N, 
                          death.count = sum(death),
                          death.N = sum(df$death),
                          death.prop=(sum(death)/sum(df$death))*100,
                          icu.count = sum(ICU),
                          icu.N = sum(df$ICU),
                          icu.prop=(sum(ICU)/sum(df$ICU))*100),
                          by=.(kidney)]

# Bind outputs into a table
  table <- rbind( serum[,2:9,with = FALSE],
                  kidney[,2:9,with = FALSE])

table
>      test test.N death.count death.N death.prop icu.count icu.N icu.prop
> 1:  serum    806         602     752   80.05319       511   632 80.85443
> 2: kidney    190         141     752   18.75000       128   632 20.25316

score 1 · Accepted Answer

編集：これは、元のポスターと問題について話し合った後に提供された改訂された回答です。手元の問題を解決しない古い答えは、後世のために以下に保持されます。

この答えは短くも簡潔でもありません、そして私はよりクリーンな方法があることを願っています。ただし、以下は機能します。

## generate example data
set.seed(1)
death<-runif(1000)<=.75
ICU<-runif(1000)<=.63
serum<-runif(1000)<=.80
urine<-runif(1000)<=.77
brain<-runif(1000)<=.92
kidney<-runif(1000)<=.22
df<-as.data.frame(cbind((1:1000),death,ICU,serum,urine,brain,kidney))

## load up our data manipulation workhorses
library(reshape2)
library(plyr)

## save typing by saving row and column var names
row.vars <- c("serum", "urine", "brain", "kidney")
col.vars <- c("death", "ICU")

## melt data so we have death/icu in a column
dat.m <- melt(df, measure.vars = row.vars)

## get rid of rows with death==0 and ICU==0
dat.m <- dat.m[dat.m$value == 1, ]

## for each of death and icu calculate proportion of 1's
tab <- ddply(dat.m, "variable", function(DF) {
  colwise(function(x) length(x[x==1]))(DF[col.vars])
})

## calculate overall proportions for row and column vars
row.nums <- sapply(df[row.vars], function(x) length(x[x==1]))
col.nums <- sapply(df[col.vars], function(x) length(x[x==1]))

## paste row and column counts into row and column names
rownames(tab) <- paste(tab$variable, " (N=", row.nums, ")", sep="")
tab$variable <- NULL
colnames(tab) <- paste(names(tab), " (N=", col.nums, ")", sep="")

## calculate cell proportions and paste them in one column at a time
tab[[1]] <- paste(tab[[1]],
                  " (",
                  round(100*(tab[[1]]/col.nums[[1]]), digits=2),
                  "%)",
                  sep="")
tab[[2]] <- paste(tab[[2]],
                  " (",
                  round(100*(tab[[2]]/col.nums[[2]]),
                        digits=2),
                  "%)",
                  sep="")

今、私たちはできます

## behold the fruits of our labor
tab
               death (N=752)  ICU (N=632)
serum (N=806)   602 (80.05%) 511 (80.85%)
urine (N=739)   556 (73.94%)  462 (73.1%)
brain (N=910)   684 (90.96%) 576 (91.14%)
kidney (N=190)  141 (18.75%) 128 (20.25%)

OLD ANSWER（目前の問題は解決しませんが、関連するタスクには役立つ場合があります）

これは簡単なはずのことの1つですが、どういうわけかそうではありません。

2つの列を表にまとめる準備ができたら、これに対処する既存の質問があります。その部分は簡単です：

# function to genderate example data
mkdat <- function() factor(sample(letters[1:4], 10, replace=TRUE), levels=letters[1:4])

# make example data
set.seed(10)
dat <- data.frame(id = 1:10, var1 = mkdat(), var2=mkdat(), var3=mkdat())

# use reshape2 package to reshape from wide to long form
library(reshape2)
dat.m <- melt(dat, id.vars="id")
dat.m$value <- factor(dat.m$value)

次に、のクロスタブdat.m$variableとdat.m$value正しいセルを指定します。そこからテーブルのカウントとパーセントの両方を取得する方法については、上記のリンクされた質問を参照するか、次の方法を使用できます。

# tabulate
library(plyr)
tab <- ddply(dat.m, "variable",
             function(DF) {
               # get counts with table
               count <- table(DF$value)
               # convert counts to percent
               prop <- paste(prop.table(count)*100, "%", sep="")
               # combine count and percent
               cp <- paste(count, " (", prop, ")", sep="")
               # re-attach the names
               names(cp) <- levels(DF$value)
               return(cp)
             })

# get row n
tab.r <- table(dat.m$variable)
# get column n
tab.c <- table(dat.m$value)
# paste row and column n into row and column names
colnames(tab) <- paste(colnames(tab), " (n = ", tab.c, ")", sep="")
rownames(tab) <- paste(tab$variable, " (n = ", tab.r, ")", sep="")
tab$variable <- NULL

# works, but that was way too much effort.
print(tab)

これは、単純なカウントと比率のテーブルでは多くの作業であることを認める必要があります。誰かがもっと簡単な方法でやってくれたら嬉しいです。

r - Rを使用して、大きな複数列の度数分布表を作成します

2 に答える 2

Related

Reference