r - dplyr group_by と tapply の結果の違いを理解する

Question

これら 2 つの実行で同じ結果が得られることを期待していましたが、結果は異なります。dplyr コードがどのように機能するかを本当に理解しているかどうか疑問に思います (パッケージとオンラインで dplyr について見つけることができるほとんどすべてを読みました)。結果が異なる理由、または同様の結果を得る方法を誰か説明できますか?

library(dplyr)
x <- iris
x <- x %.%
    group_by(Species, Sepal.Width) %.%
    summarise (freq=n()) %.%
    summarise (mean_by_group = mean(Sepal.Width))  
print(x)

x <- iris
x <- tapply(x$Sepal.Width, x$Species, mean)
print(x)

更新: これが最も効率的な方法だとは思いませんが、次のコードは、tapply アプローチと一致する結果をもたらします。Hadleyの提案に従って、結果を1行ずつ精査しましたが、これはdplyrを使用して思いついた最高のものです

library(dplyr)
x <- iris
x <- x %.%
    group_by(Species, Sepal.Width) %.%
    summarise (freq=n()) %.%
    mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)

更新: なんらかの理由で、分析したいすべての変数をグループ化する必要があると考えました。これは私が必要としていたすべてであり、パッケージ内の例に近いものです。

x <- iris %.%
    group_by(Species) %.%
    summarise(Sepal.Width = mean(Sepal.Width))
print(x)

score 3 · Accepted Answer

Maybe this...

- `dplyr`:

require(dplyr)

iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))

  # Source: local data frame [3 x 2]
  #
  #      Species        mean_width
  # 1     setosa             3.428
  # 2 versicolor             2.770
  # 3  virginica             2.974

- `tapply`:

tapply(iris$Sepal.Width, iris$Species, mean)

  # setosa versicolor  virginica 
  # 3.428      2.770      2.974

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))

  # [1] "double"

it returns a list otherwise:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))

  # [1] "list"

So to actually get the same type of output form tapply() you would need:

tbl_df( 
  data.frame( 
    mean_width = tapply( iris$Sepal.Width, 
                         iris$Species, 
                         mean )))

  # Source: local data frame [3 x 1]
  #
  #            mean_width
  # setosa          3.428
  # versicolor      2.770
  # virginica       2.974

and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

r - dplyr group_by と tapply の結果の違いを理解する

1 に答える 1

- dplyr:

- tapply:

NOTE: tapply() simplifies output by default whereas summarise() does not:

Related

Reference

- `dplyr`:

- `tapply`:

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not: