r - データフレームの要約とランク付け

Question

R を使用して、部門ごとに支出が最も多い上位 2 人の従業員のレポートを作成し、その部門の他の従業員の「その他」を追加する必要があります。たとえば、このようなレポートが必要です。

Dept.      EmployeeId     Expense
Marketing       12345         100
Marketing       12346          90
Marketing      Others         200
Sales           12347          50 <-- There's just one employee with expenses
Research        12348        2000
Research        12349         900
Research       Others       10000

つまり、支出額が最も多い上位 2 人の従業員に焦点を当ててデータを要約する必要があります。経費列の合計は、会社の経費の合計額である必要があります。

employeIds <- sample(1000:9999, 20)
depts <- sample(c('Sales', 'Marketing', 'Research'), 20, replace = TRUE)
expenses <- sample(1:1000, 20, replace = TRUE)

df <- data.frame(employeIds, depts, expenses)

# Based on that data, how do I build a table with the top 2 employees with the most expenses in each department, including an "Other" employee per department.

私はRが初めてで、これにアプローチする方法がわかりません。SQL では、RANK() 関数と JOIN を使用できたはずですが、ここではオプションではありません。

score 2 · Accepted Answer

最もエレガントではないかもしれませんが、それは解決策です：

func <- function(data) {
 data1 <- aggregate(data$expenses, list(employeIds=data$employeIds), sum)
 # rank without ties.method = "first" will screw things up with identical values
 data1$employeIds[!(rank(data1$x, ties.method="first") %in% 1:2)] <- 'Others'
 data1 <- aggregate(data.frame(expenses=data1$x), list(employeIds=data1$employeIds), sum)
}

do.call(rbind, by(df, df$depts, func))

score 2 · Accepted Answer

別のdata.tableアプローチ（あなたが知っているSQLスタイルに近いかもしれません）：

dt <- data.table(employeIds, depts, expenses)
dt[, rank:=rank(-expenses), by=depts][,
    list("Expenses"=sum(expenses)),
    keyby=list(depts, "Employee"=ifelse(rank<=2,employeIds,"Other"))
]
       depts Employee Expenses
1: Marketing     6988      986
2: Marketing     7011      940
3: Marketing    Other     2614
4:  Research     2434      763
5:  Research     9852      731
6:  Research    Other     3397
7:     Sales     3120      581
8:     Sales     6069      868

score 1 · Accepted Answer

df <- split(df, df$depts)
df <- lapply(df, FUN=function(x){
  x <- x[order(x$expenses, decreasing=TRUE), ]
  x$total.expenses <- sum(x$expenses)
  x$group <- 1:nrow(x)
  x$group <- ifelse(x$group <= 2, x$group, "Other")
  x
})
df <- do.call(rbind, df)

r - データ フレームの要約とランク付け

4 に答える 4

Related

Reference

r - データフレームの要約とランク付け