私は寄木細工のファイルを使用して、Python を使用して Spark データフレームにデータを保持しています。
寄木細工は正しく保存されているように見えますが、データフレームに再度読み込まれると、df.show() が生成され、寄木細工のディレクトリに欠落しているファイルがあることを示すトレースバックでエラーが発生します。
奇妙なことに、エラーが発生した直後に ls コマンドを実行すると、ファイルがそこにあることが示されます。
何が起こっているのかについて何か考えはありますか?
ipynb の関連部分を平文で以下に示します。
In [12]:
#persist this DF as a Parquet File
confirmedSignalsDF.saveAsParquetFile("correctedDoppler.parquet")
In [13]:
# ls shows that the parquet directory and all data files have been properly created
ls -l correctedDoppler.parquet
total 932
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 0 Oct 5 18:22 _SUCCESS
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 1118 Oct 5 18:22 _common_metadata
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 41992 Oct 5 18:22 _metadata
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 36268 Oct 5 18:20 part-r-00001.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 36631 Oct 5 18:20 part-r-00002.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 34087 Oct 5 18:20 part-r-00003.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 20103 Oct 5 18:20 part-r-00004.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 22344 Oct 5 18:20 part-r-00005.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 13438 Oct 5 18:20 part-r-00006.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 14898 Oct 5 18:20 part-r-00007.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 20501 Oct 5 18:20 part-r-00008.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 21550 Oct 5 18:20 part-r-00009.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 24827 Oct 5 18:20 part-r-00010.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 18216 Oct 5 18:20 part-r-00011.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 19561 Oct 5 18:20 part-r-00012.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 22652 Oct 5 18:20 part-r-00013.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 6134 Oct 5 18:20 part-r-00014.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 4275 Oct 5 18:18 part-r-00015.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 7383 Oct 5 18:19 part-r-00016.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 5188 Oct 5 18:19 part-r-00017.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 26420 Oct 5 18:20 part-r-00018.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 22254 Oct 5 18:20 part-r-00019.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 28356 Oct 5 18:20 part-r-00020.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 15422 Oct 5 18:20 part-r-00021.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 17046 Oct 5 18:20 part-r-00022.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 19774 Oct 5 18:20 part-r-00023.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 28917 Oct 5 18:20 part-r-00024.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 24184 Oct 5 18:20 part-r-00025.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 21649 Oct 5 18:20 part-r-00026.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 22093 Oct 5 18:20 part-r-00027.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 19092 Oct 5 18:20 part-r-00028.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 16031 Oct 5 18:20 part-r-00029.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 10181 Oct 5 18:20 part-r-00030.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 8465 Oct 5 18:20 part-r-00031.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 10999 Oct 5 18:20 part-r-00032.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 11059 Oct 5 18:20 part-r-00033.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 21826 Oct 5 18:20 part-r-00034.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 21474 Oct 5 18:20 part-r-00035.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 28181 Oct 5 18:20 part-r-00036.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 30956 Oct 5 18:20 part-r-00037.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 40088 Oct 5 18:20 part-r-00038.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 23564 Oct 5 18:20 part-r-00039.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 24405 Oct 5 18:20 part-r-00040.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 4740 Oct 5 18:20 part-r-01021.parquet\
In [14]:
# Load parquet file just saved and then count/show the DF
confirmedSignalsDF = sqlContext.parquetFile("correctedDoppler.parquet")
confirmedSignalsDF.count
confirmedSignalsDF.show(10)
# This will result in a tracebock error saying that one of the data files does not exist...
# But the ls command in the cell immediately below the error shows that this file does exist
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-6594c02048b7> in <module>()
2 confirmedSignalsDF = sqlContext.parquetFile("correctedDoppler.parquet")
3 confirmedSignalsDF.count
----> 4 confirmedSignalsDF.show(10)
/usr/local/src/spark/python/pyspark/sql/dataframe.py in show(self, n)
271 5 Bob
272 """
--> 273 print self._jdf.showString(n).encode('utf8', 'ignore')
274
275 def __repr__(self):
/usr/local/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/usr/local/src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o140.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 10 times, most recent failure: Lost task 0.9 in stage 9.0 (TID 4590, yp-spark-dal09-env5-0037):
java.io.FileNotFoundException: File file:/home/s26e-5a5fbda111ac17-5edfd8a0d95d/notebook/notebooks/correctedDoppler.parquet/part-r-00015.parquet does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1157)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:627)
at java.lang.Thread.run(Thread.java:801)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
In [15]:
#ls of the file said to be missing shows it is there
ls -l /home/s26e-5a5fbda111ac17-5edfd8a0d95d/notebook/notebooks/correctedDoppler.parquet/part-r-00015.parquet
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 4275 Oct 5 18:18 /home/s26e-5a5fbda111ac17-5edfd8a0d95d/notebook/notebooks/correctedDoppler.parquet/part-r-00015.parquet