0

In SparkR I have a DataFrame data contains id, amount_spent and amount_won.

For example for id=1 we have

head(filter(data, data$id==1))

and output is

1 30 10
1 40 100
1 22 80
1 14 2

So far I want to know if a fixed id has more won than losses. The amount can be ignored.

In R I can make it to run but it takes time. Say we have 100 id's. In R I have done this

w=c()
for(j in 1:100){
# Making it local for a fixed id 
q=collect(filter(data, data$id==j))
# Checking the difference. 1 means wins and 0 means losses
if( as.numeric(q$amount_won) - as.numeric(q$amount_spent)>0 {
w[j]=1 
}
else{w[j]=0}
}

Now w simply gives me 1's and 0's for all the id's. In sparkR I want to do this a more faster way.

4

1 に答える 1

1

これがあなたの希望通りかどうかはわかりませんので、遠慮なく調整を依頼してください。

df <- data.frame(id = c(1,1,1,1),
                 amount_spent = c(30,40,22,14),
                 amount_won = c(10,100,80,2))

DF <- createDataFrame(sqlContext, df)
DF <- withColumn(DF, "won", DF$amount_won > DF$amount_spent)
DF$won <- cast(DF$won, "integer")

grouped <- groupBy(DF, DF$id)
aggregated <- agg(grouped, total_won = sum(DF$won), total_games = n(DF$won))

result <- withColumn(aggregated, "percentage_won" , aggregated$total_won/aggregated$total_games)

collect(result)

ID がその行に費やしたよりも多く獲得したかどうかを示す列を DF に追加しました。結果は、誰かがプレイしたゲームの数、勝ったゲームの数、勝ったゲームのパーセンテージを出力します。

于 2015-09-08T14:21:15.670 に答える