r - reshape+castを使用して複数の列に集約

Question

Rには、座席（因子）、党（因子）、投票（数値）の列を持つデータフレームがあります。シート、勝者、投票シェアの列を持つ要約データフレームを作成したいと思います。たとえば、データフレームから

df <- data.frame(party=rep(c('Lab','C','LD'),times=4),
                 votes=c(1,12,2,11,3,10,4,9,5,8,6,15),
                 seat=rep(c('A','B','C','D'),each=3))

出力を取得したい

  seat winner voteshare
1    A      C 0.8000000
2    B    Lab 0.4583333
3    C      C 0.5000000
4    D     LD 0.5172414

私はこれを達成する方法を理解することができます。しかし、もっと良い方法があるはずだと私は確信している。おそらくハドリー・ウィッカムのreshapeパッケージを使った狡猾なワンライナーだろう。助言がありますか？

価値があるので、私のソリューションは私のパッケージの関数を使用し、 djwutils_2.10.zip次のように呼び出されます。しかし、それが扱わないあらゆる種類の特殊なケースがあるので、私はむしろ他の誰かのコードに頼りたいです。

aggregateList(df, by=list(seat=seat),
              FUN=list(winner=function(x) x$party[which.max(x$votes)],
                       voteshare=function(x) max(x$votes)/sum(x$votes)))

score 11 · Accepted Answer

ハドリーのplyrパッケージはあなたを助けるかもしれません：

ddply(df, .(seat), function(x) data.frame(winner=x[which.max(x$votes),]$party, voteshare=max(x$votes)/sum(x$votes)))

score 3 · Accepted Answer

狡猾なワンライナーがあるのは正しいかもしれません。私は、特に何かを最初に見ているときは、賢いよりも理解しやすいというアプローチを好む傾向があります。これは、より冗長な代替手段です。

votes_by_seat_and_party <- as.matrix(cast(df, seat ~ party, value="votes"))

   C Lab LD
A 12   1  2
B  3  11 10
C  9   4  5
D  6   8 15

seats <- rownames(votes_by_seat_and_party)
parties <- colnames(votes_by_seat_and_party)

winner_col <- apply(votes_by_seat_and_party, 1, which.max)
winners <- parties[winner_col]
voteshare_of_winner_by_seat <- apply(votes_by_seat_and_party, 1, function(x) max(x) / sum(x))

results <- data.frame(seat = seats, winner = winners, voteshare = voteshare_of_winner_by_seat)

  seat winner voteshare
1    A      C 0.8000000
2    B    Lab 0.4583333
3    C      C 0.5000000
4    D     LD 0.5172414

# Full voteshare matrix, if you're interested
total_votes_by_seat <- rowSums(votes_by_seat_and_party)
voteshare_by_seat_and_party <- votes_by_seat_and_party / total_votes_by_seat

score 2 · Accepted Answer

OK、3 つのソリューション... ここでは生の R を使用した別のよりコンパクトなソリューションを示します。これは 4 つのまばらなコード行です。欠損値は問題にならないため、0 または欠損値であると想定しています。私の推測では、これが大規模なデータセットの最速のコードになると思います。

#get a sum for dividing
s <- aggregate(df$votes, list(seat = df$seat), sum)
#extract the winner and seat
temp <- aggregate(df$votes, list(seat = df$seat), max)
res <- df[df$seat %in% temp$seat & df$votes %in% temp$x,]
res$votes <- res$votes / s$x

必要に応じて列の名前を変更します...

res$names <- c('party', 'voteshare', 'winner')

（これは同点の場合にエラーを返します...一時データフレームでそれを見ることができます）

r - reshape+castを使用して複数の列に集約

3 に答える 3

Related

Reference