apache-spark - データセットをバイナリファイル/寄木細工の床にシリアル化する方法は?

Question

をシリアル化するにはどうすればよいDataSetですか? Encoderを使用してバイナリファイルを作成する方法はありますか、それともに変換してからDataFrame寄木細工として保存する必要がありますか?

score 2 · Accepted Answer

DataSet をシリアル化するにはどうすればよいですか?

dataset.toDF().write.parquet("")

データセットで使用されているスキーマに自動的に準拠すると思います。

Encoder を使用してバイナリファイルを作成する方法はありますか

(for 1.6.0)のソースコードに基づいてEncoder、入力データソースを Dataset に変換するように設計されています (InternalRow正確には to と from ですが、それは非常に低レベルの詳細です)。デフォルトの実装では、データフレームのすべての列をケースクラス (scala の場合) またはタプルまたはプリミティブに一致させて、データセットを生成します。

score 1 · Accepted Answer

I think you are using Java or Scala, right? Because PySpark doesn't have support for Dataset yet. In my experience the best you can do is to save your data as parquet file in HDFS, because I have noticed that the time required to read the file gets reduced comparing it with other formats like csv and others.

Sorry for my digression, but I thought it was important. As you can see in the documentation of Dataset class, you can't notice any method to save the data, therefore my suggestion is to use toDF method from Dataset and then using write method from DataFrame. Or also use the DataFrameWriter final class, using the parquet method.

apache-spark - データセットをバイナリ ファイル/寄木細工の床にシリアル化する方法は?

2 に答える 2

Related

Reference

apache-spark - データセットをバイナリファイル/寄木細工の床にシリアル化する方法は?