2500 万の整数を並べ替えようとしています。しかし、使用しようとするとcollect()
、OutofMemory Error: Java Heap Space
エラーが発生します。以下はソースコードです。
sc = SparkContext("local", "pyspark")
numbers = sc.textFile("path of text file")
counts = numbers.flatMap(lambda x: x.split()).map(lambda x: (int(x), 1)).sortByKey(lambda x:x)
num_list = []
for (num, count) in counts.collect():
num_list.append(num)
どこが間違っていますか?テキストファイルのサイズは147MBです。すべての設定はデフォルトです。Spark v0.9.0 を使用しています。
編集: 250 万の整数を含む Works ファイル。しかし、問題は500万から始まります。また、1,000 万でテストしたところ、同じ OME エラーが発生しました。
スタック トレースは次のとおりです。
14/02/06 22:44:31 ERROR Executor: Exception in task ID 5
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2798)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
14/02/06 22:44:31 WARN TaskSetManager: Lost TID 5 (task 0.0:0)
14/02/06 22:44:31 WARN TaskSetManager: Loss was due to java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2798)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:28)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:48)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:223)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
14/02/06 22:44:31 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting job
14/02/06 22:44:31 INFO TaskSchedulerImpl: Remove TaskSet 0.0 from pool
14/02/06 22:44:31 INFO DAGScheduler: Failed to run collect at <ipython-input-7-cf9439751c70>:1