r - ID で集計し、時間の min() と max() を見つけます

Question

次のようなトランザクションデータベースを取得しました。

   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68

PaymentDate は posix 形式です。トランザクションデータでは、時間間隔 (これは十分に文書化されています) ではなく、ID で集計したいと考えています。

min() を Posix 時間で使用すると、最初の日、 max() で最後の日が得られます。これは、すべての ID に必要な情報です。

わかりました、これが私が試したことです：

# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length

# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge

# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'

どうやら、時間による集計ではなく時間の集計が必要な場合、集計は posix 時間では機能しません。

しかし、最初と最後の取引日を取得することは可能でなければなりません?!

返事

わかりました、私はまだ自分の質問に答えることができないので、ここに投稿します：

面白い。ありがとうございました！

私は通常、read.csv で as.is=T オプションを使用してから、strptime を使用して時間を変換します。したがって、データの構造を見ると、次のようになります。

$ PaymentDate    : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...

私には、それは要因のようには見えません。列全体で min() と max() を使用でき、機能します。どうやらPOSIXltは思ったより面倒くさいらしい。POSIXltから来て、私はやった

data$PaymentDate=as.Date(data$PaymentDate)

構造を見ると、Class は Date として正しく設定されています。

$ PaymentDate    :Class 'Date'  num [1:10000] 18296 18296 18296 18297 18297 ...

今ではうまくいくようです。ただし、ddply のみが正しい形式 "2020-01-25" を返しますが、 aggregateとsummaryByは両方とも"18286" 形式で返します。それは 1970 年 1 月 1 日からのことですか? まあ、元に戻せると思います。

foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")

ただし、何らかの説明が必要です。また、ddply は非常に低速です。

ああ、なぜ最初に strptime を使うのですか? 元のファイルの日付は、「%d-%m-%y」という別の形式になっています。これに as.Date を直接使用しても機能しないようです。

編集

私のデータの出力

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L), 
    mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L, 
    3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L, 
    7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L), 
    year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 
    110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L, 
    109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L, 
    109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L, 
    109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L, 
    4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L, 
    1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L, 
    2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L, 
    0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L, 
    273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L, 
    122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L), 
    isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID", 
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")

あなたが提案したようにした後にdput：

structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277, 
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550, 
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823, 
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386, 
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540, 
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L, 
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L, 
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L, 
52L, 53L, 54L, 55L, 40L), class = "data.frame")

生データの出力

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010", 
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010", 
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009", 
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009", 
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009", 
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010", 
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009", 
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010", 
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")

score 1 · Accepted Answer

問題は、データ、特に PaymentDate 列が要因であることです。最初に PaymentDate 列を変換すると、ソリューションddplyとソリューションの両方が記述どおりに機能します。aggregate

#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09", 
            "2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1, 
            4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
    ), class = "data.frame", row.names = c("8", "9", "10", "11"))

変数クラスを Date に変更します。

df$PaymentDate <- as.Date(df$PaymentDate)

次に、元のコードを実行します。ddply の使用:

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

集計の使用:

aggregate(PaymentDate ~ AccountID, df, min)
  AccountID PaymentDate
1        13  2020-01-25
2        14  2020-01-01

この問題を回避する別のより一般的な方法があります。デフォルトでは、を使用してread.table(またはそのバリアントのようなread.csv) data.frame を作成すると、パラメーターstringsAsFactorsはに設定されTRUEます。を使用してデータを再作成するstringsAsFactors=FALSEと、PaymentDate を変換する中間ステップは必要なく、コードは次のように機能します。

dat <- "   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68 "

df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

r - ID で集計し、時間の min() と max() を見つけます

返事

編集

1 に答える 1

Related

Reference