Spark SQL と Spark Streaming を統合するときの Not Serializable 例外に加えて
私のソースコード
public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName("NumberCount");
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(jc, new Duration(2000));
jssc.addStreamingListener(new WorkCountMonitor());
int numThreads = Integer.parseInt(args[3]);
Map<String,Integer> topicMap = new HashMap<String,Integer>();
String[] topics = args[2].split(",");
for (String topic : topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String,String> data = KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
data.print();
JavaDStream<Person> streamData = data.map(new Function<Tuple2<String, String>, Person>() {
public Person call(Tuple2<String,String> v1) throws Exception {
String[] stringArray = v1._2.split(",");
Person Person = new Person();
Person.setName(stringArray[0]);
Person.setAge(stringArray[1]);
return Person;
}
});
final JavaSQLContext sqlContext = new JavaSQLContext(jc);
streamData.foreachRDD(new Function<JavaRDD<Person>,Void>() {
public Void call(JavaRDD<Person> rdd) {
JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd, Person.class);
subscriberSchema.registerAsTable("people");
System.out.println("all data");
JavaSchemaRDD names = sqlContext.sql("SELECT name FROM people");
System.out.println("afterwards");
List<String> males = new ArrayList<String>();
males = names.map(new Function<Row,String>() {
public String call(Row row) {
return row.getString(0);
}
}).collect();
System.out.println("before for");
for (String name : males) {
System.out.println(name);
}
return null;
}
});
jssc.start();
jssc.awaitTermination();
}
JavaSQLContext も ForeachRDD ループの外で宣言されていますが、まだ NonSerializableException が発生しています
14/12/23 23:49:38 エラー JobScheduler: ジョブ ストリーミング ジョブの実行中にエラーが発生しました 1419378578000 ms.1 org.apache.spark.SparkException: org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala でタスクをシリアル化できません:166) org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) で org.apache.spark.SparkContext.clean(SparkContext.scala:1435) で org.apache.spark.rdd.RDD .map(RDD.scala:271) at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:78) at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD) .scala:42) で com.basic.spark.NumberCount$2.call(NumberCount.java:79) で com.basic.spark.NumberCount$2.call(NumberCount.java:67) で org.apache.spark.streaming. api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274) at org.apache.spark.streaming.dstream.DStream$ $anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:529) at org.apache.spark.streaming. dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache. spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) で scala.util.Try$.apply(Try.scala:161) org.apache.spark.streaming.scheduler.Job.run で(Job.scala:32) org.apache.spark.streaming.scheduler で。JobScheduler$JobHandler.run(JobScheduler.scala:171) で java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) で java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) で java .lang.Thread.run(Thread.java:724) 原因: java.io.NotSerializableException: org.apache.spark.sql.api.java.JavaSQLContext at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1181) java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) で java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) で java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) で java.io. java.io.ObjectOutputStream の ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175)。defaultWriteFields(ObjectOutputStream.java:1541) で java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) で java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) で java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java) :1175) で java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1541) で java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1506) で java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1429) で Java .io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1175) で java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) で org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 20 以上
何か提案があればよろしくお願いします。