r - data.table が 2 つの列で適切に要約されていない

Question

data.table私は最近私を夢中にさせているこの問題を抱えています。バグのように見えますが、ここで明らかな何かが欠けている可能性があります。

次のデータフレームがあります。

# First some data
data <- data.table(structure(list(
  month = structure(c(1356998400, 1356998400, 1356998400, 
                      1359676800, 1354320000, 1359676800, 1359676800, 1356998400, 1356998400, 
                      1354320000, 1354320000, 1354320000, 1359676800, 1359676800, 1359676800, 
                      1356998400, 1359676800, 1359676800, 1356998400, 1359676800, 1359676800, 
                      1359676800, 1359676800, 1354320000, 1354320000), class = c("POSIXct", 
                                                                                 "POSIXt"), tzone = "UTC"), 
  portal = c(TRUE, TRUE, FALSE, TRUE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
             TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE
  ), 
  satisfaction = c(10L, 10L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 
                   9L, 2L, 8L, 10L, 9L, 10L, 10L, 9L, 10L, 10L, 10L, 9L, 10L, 9L, 
                   10L, 10L)), 
                  .Names = c("month", "portal", "satisfaction"), 
                  row.names = c(NA, -25L), class = "data.frame"))

とでまとめたいと思いportalますmonth。古き良きtapply作品を期待どおりに要約すると、2012 年 12 月と 2013 年 1 月から 2 月の結果を含む 3x2 マトリックスが得られます。

> tapply(data$satisfaction, list(data$month, data$portal), mean)
           FALSE      TRUE
2012-12-01   8.5  8.000000
2013-01-01  10.0 10.000000
2013-02-01   9.0  9.545455

byの引数を使用した要約は、次のことをdata.table行いません。

> data[, mean(satisfaction), by = 'month,portal']
   month      portal        V1
1: 2013-01-01  FALSE 10.000000
2: 2013-02-01   TRUE  9.000000
3: 2013-01-01   TRUE 10.000000
4: 2012-12-01  FALSE  8.500000
5: 2012-12-01   TRUE  7.333333
6: 2013-02-01   TRUE  9.666667
7: 2013-02-01  FALSE  9.000000
8: 2012-12-01   TRUE 10.000000

ご覧のとおり、期待どおりの6 個ではなく、 8 個の値を持つデータテーブルが返されます。たとえば、とが重複している値。portal == TRUEmonth == 2012-02-01

興味深いことに、これを 2013 年のデータだけに制限すると、すべてが期待どおりに機能します。

> data[month >= ymd(20130101), mean(satisfaction), by = 'month,portal']
        month portal        V1
1: 2013-01-01   TRUE 10.000000
2: 2013-01-01  FALSE 10.000000
3: 2013-02-01   TRUE  9.545455
4: 2013-02-01  FALSE  9.000000

私は信じられないほど困惑しています:)。誰か助けてくれませんか？

score 8 · Accepted Answer

これは、data.table 1.8.7 で解決された既知の問題です (この記事の執筆時点ではまだ CRAN では解決されていません)。

data.table NEWSから:

BUG FIXES

    <...>

o   setkey could sort 'double' columns (such as POSIXct) incorrectly when not the
    last column of the key, #2484. In data.table's C code :
        x[a] > x[b]-tol
    should have been :
        x[a]-x[b] > -tol  [or  x[b]-x[a] < tol ]
    The difference may have been machine/compiler dependent. Many thanks to statquant
    for the short reproducible example. Test added.

を使用して 1.8.7 に更新するとinstall.packages("data.table", repos="http://R-Forge.R-project.org")、すべてが期待どおりに機能します。

score 5 · Accepted Answer

問題は並べ替えにあるようです。ロードdataして実行するとsetkey：

setkey(data, "month", "portal")

# > data
#          month portal satisfaction
#  1: 2012-12-01   TRUE           10
#  2: 2012-12-01  FALSE            9
#  3: 2012-12-01  FALSE            8
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01   TRUE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01  FALSE           10
# 14: 2013-02-01   TRUE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01  FALSE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

portal列が正しくソートされていないことがわかります。setkeyもう一度やると、

setkey(data, "month", "portal")

# I get this warning message:
Warning message:
In setkeyv(x, cols, verbose = verbose) :
  Already keyed by this key but had invalid row order, key rebuilt. 
  If you didn't go under the hood please let datatable-help know so 
  the root cause can be fixed.

これで、data列はキー列によって適切にソートされたように見えます。

# > data
#          month portal satisfaction
#  1: 2012-12-01  FALSE            9
#  2: 2012-12-01  FALSE            8
#  3: 2012-12-01   TRUE           10
#  4: 2012-12-01   TRUE            2
#  5: 2012-12-01   TRUE           10
#  6: 2012-12-01   TRUE           10
#  7: 2013-01-01  FALSE           10
#  8: 2013-01-01   TRUE           10
#  9: 2013-01-01   TRUE           10
# 10: 2013-01-01   TRUE           10
# 11: 2013-01-01   TRUE           10
# 12: 2013-01-01   TRUE           10
# 13: 2013-01-01   TRUE           10
# 14: 2013-02-01  FALSE            9
# 15: 2013-02-01   TRUE            9
# 16: 2013-02-01   TRUE            9
# 17: 2013-02-01   TRUE           10
# 18: 2013-02-01   TRUE           10
# 19: 2013-02-01   TRUE           10
# 20: 2013-02-01   TRUE           10
# 21: 2013-02-01   TRUE           10
# 22: 2013-02-01   TRUE            9
# 23: 2013-02-01   TRUE           10
# 24: 2013-02-01   TRUE            9
# 25: 2013-02-01   TRUE            9
#          month portal satisfaction

それで、それはPOSIXct + logicalタイプのソートに問題があるようですか？

data[, mean(satisfaction), by=list(month, portal)]

#         month portal        V1
# 1: 2012-12-01  FALSE  8.500000
# 2: 2012-12-01   TRUE  8.000000
# 3: 2013-01-01  FALSE 10.000000
# 4: 2013-01-01   TRUE 10.000000
# 5: 2013-02-01  FALSE  9.000000
# 6: 2013-02-01   TRUE  9.545455

したがって、バグがあると思います。

r - data.table が 2 つの列で適切に要約されていない

2 に答える 2

Related

Reference