scala - Apache Spark は、別の行に基づいて RDD またはデータセットの行を更新します

Question

別の行に基づいていくつかの行を更新する方法を理解しようとしています。

たとえば、次のようなデータがあります

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

同じ都市のユーザーを同じ groupId (1 または 2) に更新したい

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

RDD または Dataset でこれを達成するにはどうすればよいですか?

完全を期すために、Idが String の場合、dense ランクは機能しないのでしょうか?

例えば？

Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

したがって、結果は次のようになります。

grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

score 2 · Accepted Answer

これを行うためのクリーンな方法は、dense_rank()fromWindow関数を使用することです。列内の一意の値を列挙しますWindow。cityはString列であるため、これらはアルファベット順に増加します。

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+

scala - Apache Spark は、別の行に基づいて RDD またはデータセットの行を更新します

2 に答える 2

Related

Reference