r - R のいくつかの変数を持つ頻度表

Question

公式統計でよく使用されるテーブルを複製しようとしていますが、これまでのところ成功していません。次のようなデータフレームがあるとします。

d1 <- data.frame( StudentID = c("x1", "x10", "x2", 
                          "x3", "x4", "x5", "x6", "x7", "x8", "x9"),
             StudentGender = c('F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'),
             ExamenYear    = c('2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'),
             Exam          = c('algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'),
             participated  = c('no','yes','yes','yes','no','yes','yes','yes','yes','yes'),  
             passed      = c('no','yes','yes','yes','no','yes','yes','yes','no','yes'),
             stringsAsFactors = FALSE)

PER YEAR 、全学生数（全員）と女性、参加者、合格者を示す表を作成したいと思います。以下の「うち」はすべての学生を指すことに注意してください。

私が考えているテーブルは次のようになります。

cbind(All = table(d1$ExamenYear),
  participated      = table(d1$ExamenYear, d1$participated)[,2],
  ofwhichFemale     = table(d1$ExamenYear, d1$StudentGender)[,1],
  ofwhichpassed     = table(d1$ExamenYear, d1$passed)[,2])

Rでこの種のことを行うためのより良い方法があると確信しています.

注：LaTexソリューションを見たことがありますが、Excelでテーブルをエクスポートする必要があるため、これはうまくいきません。

前もって感謝します

score 9 · Accepted Answer

使用plyr:

require(plyr)
ddply(d1, .(ExamenYear), summarize,
      All=length(ExamenYear),
      participated=sum(participated=="yes"),
      ofwhichFemale=sum(StudentGender=="F"),
      ofWhichPassed=sum(passed=="yes"))

これにより、次のことが得られます。

  ExamenYear All participated ofwhichFemale ofWhichPassed
1       2007   3            2             2             2
2       2008   4            3             2             3
3       2009   3            3             0             2

score 4 · Accepted Answer

コードにいくつかの変更 (呼び出しwithの数を減らすために使用しdf$、自己文書化を改善するために文字インデックスを使用する) があった可能性があります。これにより、コードが読みやすくなり、ddplyソリューションの価値のある競合相手になります。

with( d1, cbind(All = table(ExamenYear),
  participated      = table(ExamenYear, participated)[,"yes"],
  ofwhichFemale     = table(ExamenYear, StudentGender)[,"F"],
  ofwhichpassed     = table(ExamenYear, passed)[,"yes"])
     )

     All participated ofwhichFemale ofwhichpassed
2007   3            2             2             2
2008   4            3             2             3
2009   3            3             0             2

これは、より大きなデータセットで作業している場合にのみ明らかになりますが、ddply ソリューションよりもはるかに高速であると予想されます。

score 4 · Accepted Answer

plyrパッケージは、この種のものに最適です。最初にパッケージをロードします

library(plyr)

次に、ddply関数を使用します。

ddply(d1, "ExamenYear", summarise, 
      All = length(passed),##We can use any column for this statistics
      participated = sum(participated=="yes"),
      ofwhichFemale = sum(StudentGender=="F"),
      ofwhichpassed = sum(passed=="yes"))

基本的に、ddply は入力としてデータフレームを想定し、データフレームを返します。次に、入力データフレームをで分割しますExamenYear。各サブテーブルで、いくつかの要約統計を計算します。ddply では、$列を参照するときに表記法を使用する必要がないことに注意してください。

score 1 · Accepted Answer

plyr の次のイテレータ dplyr も参照してください。

ggplot に似た構文を使用し、重要な部分を C++ で記述することで高速なパフォーマンスを提供します。

d1 %.% 
group_by(ExamenYear) %.%    
summarise(ALL=length(ExamenYear),
          participated=sum(participated=="yes"),
          ofwhichFemale=sum(StudentGender=="F"),
          ofWhichPassed=sum(passed=="yes"))

r - R のいくつかの変数を持つ頻度表

4 に答える 4

Related

Reference