python - Hadoop ストリーミングジョブで Parquet 出力を書き込む

Question

Python を使用して Hadoop ストリーミングでテキストデータを寄木細工のファイルに書き込む方法はありますか。

基本的に、寄木細工のファイルとして保存したい IdentityMapper から出力される文字列があります。

入力または例は本当に役に立ちます

score 1 · Accepted Answer

I suspect there's no builtin way of doing this using built Hadoop Streaming (I couldn't find one), however, depending on your data sets you may use a 3rd party package as

https://github.com/whale2/iow-hadoop-streaming

To generate Parquet from JSON your streaming code would spit out json and together with an AVRO schema you could write your Parquet using ParquetAsJsonOutputFormat.

Please note that at this stage the package above has some limitations (like only being able to use primitive types, etc).

Depending on the nature of your data your may also play with Kite SDK as briefly explained here:

https://dwbigdata.wordpress.com/2016/01/31/json-to-parquet-conversion/

and here:

https://community.cloudera.com/t5/Kite-SDK-includes-Morphlines/JSON-to-Parquet/td-p/20630

Cheers

python - Hadoop ストリーミング ジョブで Parquet 出力を書き込む

1 に答える 1

Related

Reference

python - Hadoop ストリーミングジョブで Parquet 出力を書き込む