私の要件に従って、hdfsにあるファイルをORC形式のハイブテーブルに保存したいと考えています。Hive 0.14.0 バージョンで Spark 1.2.1 を使用しています。
以下のドキュメントに従っています http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_spark-quickstart/content/ch_orc-spark-quickstart.html
すべてがうまくいきました..スパークシェルに例外は見られません..
以下のように、ハイブに ORC テーブルを 1 つ作成しました。
hiveContext.sql("create table person_orc_table (name STRING, age INT) stored as orc")
以下のようにリストクエリの結果を見ることができます..
scala> hiveContext.sql("SELECT * from morePeople").collect.foreach(println)
15/08/14 09:25:06 INFO ParseDriver: Parsing command: SELECT * from morePeople
15/08/14 09:25:06 INFO ParseDriver: Parse Completed
15/08/14 09:25:06 INFO OrcFileOperator: Qualified file list:
15/08/14 09:25:06 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-0-1439544199994.orc
15/08/14 09:25:06 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-1-1439544200299.orc
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(278167) called with curMem=965233, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 271.6 KB, free 264.2 MB)
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(42885) called with curMem=1243400, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 41.9 KB, free 264.2 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on sandbox.hortonworks.com:43599 (size: 41.9 KB, free: 265.2 MB)
15/08/14 09:25:06 INFO BlockManagerMaster: Updated info of block broadcast_6_piece0
15/08/14 09:25:06 INFO DefaultExecutionContext: Created broadcast 6 from hadoopRDD at OrcTableOperations.scala:228
15/08/14 09:25:06 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:25:06 INFO deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
15/08/14 09:25:06 INFO OrcInputFormat: FooterCacheHitRatio: 0/2
15/08/14 09:25:06 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1439544306469 end=1439544306486 duration=17 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:25:06 INFO DefaultExecutionContext: Starting job: collect at SparkPlan.scala:84
15/08/14 09:25:06 INFO DAGScheduler: Got job 3 (collect at SparkPlan.scala:84) with 2 output partitions (allowLocal=false)
15/08/14 09:25:06 INFO DAGScheduler: Final stage: Stage 3(collect at SparkPlan.scala:84)
15/08/14 09:25:06 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:25:06 INFO DAGScheduler: Missing parents: List()
15/08/14 09:25:06 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[32] at map at SparkPlan.scala:84), which has no missing parents
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(72088) called with curMem=1286285, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 70.4 KB, free 264.1 MB)
15/08/14 09:25:06 INFO MemoryStore: ensureFreeSpace(46036) called with curMem=1358373, maxMem=278302556
15/08/14 09:25:06 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.1 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:25:06 INFO BlockManagerMaster: Updated info of block broadcast_7_piece0
15/08/14 09:25:06 INFO DefaultExecutionContext: Created broadcast 7 from broadcast at DAGScheduler.scala:838
15/08/14 09:25:06 INFO DAGScheduler: Submitting 2 missing tasks from Stage 3 (MappedRDD[32] at map at SparkPlan.scala:84)
15/08/14 09:25:06 INFO YarnClientClusterScheduler: Adding task set 3.0 with 2 tasks
15/08/14 09:25:06 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 6, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:25:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on sandbox.hortonworks.com:59036 (size: 41.9 KB, free: 265.3 MB)
15/08/14 09:25:06 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 7, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:25:06 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 6) in 311 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:25:07 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 7) in 119 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:25:07 INFO YarnClientClusterScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
[Michael,29]
[Andy,30]
[Justin,19]
scala> 15/08/14 09:25:07 INFO DAGScheduler: Stage 3 (collect at SparkPlan.scala:84) finished in 0.427 s
15/08/14 09:25:07 INFO DAGScheduler: Job 3 finished: collect at SparkPlan.scala:84, took 0.504132 s
オークテーブルへの保存もうまくいきました..
scala> peopleSchemaRDD.saveAsOrcFile("person_orc_table")
15/08/14 09:28:20 INFO DefaultExecutionContext: Starting job: runJob at OrcTableOperations.scala:154
15/08/14 09:28:20 INFO DAGScheduler: Got job 4 (runJob at OrcTableOperations.scala:154) with 2 output partitions (allowLocal=false)
15/08/14 09:28:20 INFO DAGScheduler: Final stage: Stage 4(runJob at OrcTableOperations.scala:154)
15/08/14 09:28:20 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:28:20 INFO DAGScheduler: Missing parents: List()
15/08/14 09:28:20 INFO DAGScheduler: Submitting Stage 4 (MapPartitionsRDD[35] at mapPartitions at OrcTableOperations.scala:70), which has no missing parents
15/08/14 09:28:20 INFO MemoryStore: ensureFreeSpace(72048) called with curMem=965233, maxMem=278302556
15/08/14 09:28:20 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 70.4 KB, free 264.4 MB)
15/08/14 09:28:20 INFO MemoryStore: ensureFreeSpace(46093) called with curMem=1037281, maxMem=278302556
15/08/14 09:28:20 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.4 MB)
15/08/14 09:28:20 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:28:20 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/08/14 09:28:20 INFO DefaultExecutionContext: Created broadcast 8 from broadcast at DAGScheduler.scala:838
15/08/14 09:28:20 INFO DAGScheduler: Submitting 2 missing tasks from Stage 4 (MapPartitionsRDD[35] at mapPartitions at OrcTableOperations.scala:70)
15/08/14 09:28:20 INFO YarnClientClusterScheduler: Adding task set 4.0 with 2 tasks
15/08/14 09:28:20 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 8, sandbox.hortonworks.com, NODE_LOCAL, 1314 bytes)
15/08/14 09:28:20 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:28:21 INFO TaskSetManager: Starting task 1.0 in stage 4.0 (TID 9, sandbox.hortonworks.com, NODE_LOCAL, 1314 bytes)
15/08/14 09:28:21 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 8) in 503 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:28:21 INFO TaskSetManager: Finished task 1.0 in stage 4.0 (TID 9) in 69 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:28:21 INFO DAGScheduler: Stage 4 (runJob at OrcTableOperations.scala:154) finished in 0.570 s
15/08/14 09:28:21 INFO YarnClientClusterScheduler: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/08/14 09:28:21 INFO DAGScheduler: Job 4 finished: runJob at OrcTableOperations.scala:154, took 0.615483 s
scala> 15/08/14 09:28:35 INFO BlockManager: Removing broadcast 8
15/08/14 09:28:35 INFO BlockManager: Removing block broadcast_8
15/08/14 09:28:35 INFO MemoryStore: Block broadcast_8 of size 72048 dropped from memory (free 277291230)
15/08/14 09:28:35 INFO BlockManager: Removing block broadcast_8_piece0
15/08/14 09:28:35 INFO MemoryStore: Block broadcast_8_piece0 of size 46093 dropped from memory (free 277337323)
15/08/14 09:28:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on sandbox.hortonworks.com:43599 in memory (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:28:35 INFO BlockManagerMaster: Updated info of block broadcast_8_piece0
15/08/14 09:28:35 INFO BlockManagerInfo: Removed broadcast_8_piece0 on sandbox.hortonworks.com:59036 in memory (size: 45.0 KB, free: 265.4 MB)
15/08/14 09:28:35 INFO ContextCleaner: Cleaned broadcast 8
以下のように orc テーブルを取得することもできます。
val morePeople = hiveContext.orcFile("person_orc_table") morePeople.registerTempTable("morePeople")
scala> hiveContext.sql("SELECT * from morePeople").collect.foreach(println)
15/08/14 09:33:32 INFO ParseDriver: Parsing command: SELECT * from morePeople
15/08/14 09:33:32 INFO ParseDriver: Parse Completed
15/08/14 09:33:32 INFO OrcFileOperator: Qualified file list:
15/08/14 09:33:32 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-0-1439544199994.orc
15/08/14 09:33:32 INFO OrcFileOperator: hdfs://sandbox.hortonworks.com:8020/user/root/people.orc/part-r-1-1439544200299.orc
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(278167) called with curMem=965233, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 271.6 KB, free 264.2 MB)
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(42885) called with curMem=1243400, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 41.9 KB, free 264.2 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on sandbox.hortonworks.com:43599 (size: 41.9 KB, free: 265.2 MB)
15/08/14 09:33:32 INFO BlockManagerMaster: Updated info of block broadcast_11_piece0
15/08/14 09:33:32 INFO DefaultExecutionContext: Created broadcast 11 from hadoopRDD at OrcTableOperations.scala:228
15/08/14 09:33:32 INFO PerfLogger: <PERFLOG method=OrcGetSplits from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:33:32 INFO OrcInputFormat: FooterCacheHitRatio: 0/2
15/08/14 09:33:32 INFO PerfLogger: </PERFLOG method=OrcGetSplits start=1439544812311 end=1439544812318 duration=7 from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl>
15/08/14 09:33:32 INFO DefaultExecutionContext: Starting job: collect at SparkPlan.scala:84
15/08/14 09:33:32 INFO DAGScheduler: Got job 6 (collect at SparkPlan.scala:84) with 2 output partitions (allowLocal=false)
15/08/14 09:33:32 INFO DAGScheduler: Final stage: Stage 6(collect at SparkPlan.scala:84)
15/08/14 09:33:32 INFO DAGScheduler: Parents of final stage: List()
15/08/14 09:33:32 INFO DAGScheduler: Missing parents: List()
15/08/14 09:33:32 INFO DAGScheduler: Submitting Stage 6 (MappedRDD[48] at map at SparkPlan.scala:84), which has no missing parents
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(72088) called with curMem=1286285, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 70.4 KB, free 264.1 MB)
15/08/14 09:33:32 INFO MemoryStore: ensureFreeSpace(46036) called with curMem=1358373, maxMem=278302556
15/08/14 09:33:32 INFO MemoryStore: Block broadcast_12_piece0 stored as bytes in memory (estimated size 45.0 KB, free 264.1 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on sandbox.hortonworks.com:43599 (size: 45.0 KB, free: 265.2 MB)
15/08/14 09:33:32 INFO BlockManagerMaster: Updated info of block broadcast_12_piece0
15/08/14 09:33:32 INFO DefaultExecutionContext: Created broadcast 12 from broadcast at DAGScheduler.scala:838
15/08/14 09:33:32 INFO DAGScheduler: Submitting 2 missing tasks from Stage 6 (MappedRDD[48] at map at SparkPlan.scala:84)
15/08/14 09:33:32 INFO YarnClientClusterScheduler: Adding task set 6.0 with 2 tasks
15/08/14 09:33:32 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 12, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_12_piece0 in memory on sandbox.hortonworks.com:59036 (size: 45.0 KB, free: 265.3 MB)
15/08/14 09:33:32 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on sandbox.hortonworks.com:59036 (size: 41.9 KB, free: 265.3 MB)
15/08/14 09:33:32 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 13, sandbox.hortonworks.com, NODE_LOCAL, 1366 bytes)
15/08/14 09:33:32 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 12) in 153 ms on sandbox.hortonworks.com (1/2)
15/08/14 09:33:32 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 13) in 106 ms on sandbox.hortonworks.com (2/2)
15/08/14 09:33:32 INFO YarnClientClusterScheduler: Removed TaskSet 6.0, whose tasks have all completed, from pool
15/08/14 09:33:32 INFO DAGScheduler: Stage 6 (collect at SparkPlan.scala:84) finished in 0.255 s
[Michael,29]
[Andy,30]
[Justin,19]
しかし、ハイブ コンテキストでクエリを実行してレコードを表示すると、レコードが表示されません。
hive> select * from person_orc_table;
OK
Time taken: 0.097 seconds
hive>
ハイブ テーブルにデータ/レコードがあることを期待しています。しかし、そこにはありません。ここで何が欠けていますか?