json - SPARK (SQL) を使用して不要な JSON フィールドを削除する

Question

私は現在、Spark といくつかのビッグデータをいじっている新しい Spark ユーザーです。Spark SQL またはより正式には SchemaRDD に関連する質問があります。いくつかの天気予報に関するデータを含む JSON ファイルを読んでいますが、私が持っているすべてのフィールドにはあまり興味がありません...各レコードに対して返される 50 以上のフィールドのうち 10 フィールドだけが必要です。スパークから削除したいいくつかのフィールドの名前を指定するために使用できる（フィルターに似た）方法はありますか。

ちょっとした説明の例です。「名前」、「年齢」、「性別」の 3 つのフィールドを持つスキーマ「人」があり、「年齢」フィールドには興味がなく、削除したいと考えています。スパークを使用してそれを行う方法を教えてください。? ありがとう

score 1 · Accepted Answer

Spark 1.2 を使用している場合は、次のことができます (Scala を使用)...

使用するフィールドがすでにわかっている場合は、これらのフィールドのスキーマを作成し、このスキーマを JSON データセットに適用できます。Spark SQL は SchemaRDD を返します。次に、それを登録してテーブルとしてクエリできます。ここにスニペットがあります...

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")

peopleSchemaRDD のスキーマ (peopleSchemaRDD.printSchema()) を見ると、名前と性別のフィールドだけが表示されます。

または、データセットを調べて、すべてのフィールドを確認した後で必要なフィールドを決定したい場合は、Spark SQL にスキーマの推測を依頼できます。次に、SchemaRDD をテーブルとして登録し、プロジェクションを使用して不要なフィールドを削除できます。ここにスニペットがあります...

// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")

score 0 · Accepted Answer

schemaRDD に含めるフィールドを指定できます。以下は例です。必要なフィールドのみを使用して、ケースクラスを作成します。データをrddに読み込み、必要なフィールドのみを指定します(ケースクラスでスキーマを指定したのと同じ順序で)。

Sample Data: People.txt
foo,25,M
bar,24,F

コード：

case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")

json - SPARK (SQL) を使用して不要な JSON フィールドを削除する

2 に答える 2

Related

Reference