Spark 1.6.3 で h2o contetx を作成しようとすると、コードで以下の例外が発生します。
17/11/06 12:01:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[H2O Launcher thread,5,main]
java.lang.NoSuchMethodError: org.joda.time.DateTime.now()Lorg/joda/time/DateTime;
at water.util.Timer.nowAsLogString(Timer.java:38)
at water.util.Log.header(Log.java:163)
at water.util.Log.write0(Log.java:131)
at water.util.Log.write0(Log.java:124)
at water.util.Log.write(Log.java:109)
at water.util.Log.log(Log.java:86)
at water.util.Log.info(Log.java:72)
at water.H2OSecurityManager.<init>(H2OSecurityManager.java:57)
at water.H2OSecurityManager.instance(H2OSecurityManager.java:79)
at water.H2ONode.<init>(H2ONode.java:127)
編集:POMファイルを添付しました。長いファイルですが、依存関係が示されています。依存関係に何か問題があるはずだと思います。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>au.com.vroc.mdm</groupId>
<artifactId>mdm</artifactId>
<version>0.0.1-SNAPSHOT</version>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<gson.version>2.8.0</gson.version>
<java.home>${env.JAVA_HOME}</java.home>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.1.1</version>
<!-- <scope>provided</scope> -->
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/ai.h2o/h2o-core -->
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-core</artifactId>
<version>3.14.0.7</version>
<!-- <scope>runtime</scope> -->
</dependency>
<!-- https://mvnrepository.com/artifact/ai.h2o/h2o-algos -->
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-algos</artifactId>
<version>3.14.0.7</version>
<!-- <scope>runtime</scope> -->
</dependency>
<!-- https://mvnrepository.com/artifact/ai.h2o/h2o-genmodel -->
<dependency>
<groupId>ai.h2o</groupId>
<artifactId>h2o-genmodel</artifactId>
<version>3.14.0.7</version>
<!-- <scope>runtime</scope> -->
</dependency>
<!-- https://mvnrepository.com/artifact/ai.h2o/sparkling-water-core_2.10 -->
<dependency>
<!-- <groupId>ai.h2o</groupId> <artifactId>sparkling-water-core_2.10</artifactId>
<version>1.6.11</version> -->
<groupId>ai.h2o</groupId>
<artifactId>sparkling-water-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>${gson.version}</version>
</dependency>
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-client-http</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>com.cloudera.livy</groupId>
<artifactId>livy-api</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
<version>7.1.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.5</version>
</dependency>
<!-- <dependency> <groupId>jdk.tools</groupId> <artifactId>jdk.tools</artifactId>
<scope>system</scope> <version>1.8</version> <systemPath>${java.home}/lib/tools.jar</systemPath>
</dependency> -->
<!-- https://mvnrepository.com/artifact/joda-time/joda-time -->
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-spark</artifactId>
<version>4.7.0-HBase-1.1</version>
<!-- <scope>provided</scope> -->
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.1.0-cdh5.4.0</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>cloudera.repo</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
<name>Cloudera Repositories</name>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>Local repository</id>
<url>file://${basedir}/lib</url>
</repository>
</repositories>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.5.2</version>
<!-- <version>3.0.0</version> -->
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!--<id>assemble-all</id> -->
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
モデルの作成は、次のように livyclient によって簡単に行われます。
public RegressionMetric call(JobContext ctx) throws Exception {
if (!checkInputValid()) {
throw new IllegalArgumentException("Mandatory parameters are not set");
} else {
RegressionMetric metric = new RegressionMetric();
Dataset<Row> sensordataDF = this.InitializeH2OModel(ctx);
SQLContext hc = ctx.sqlctx();
// Save the H2OContext so that we can extract the H2oFrames later
H2OContext h2oContext = H2OContext.getOrCreate(ctx.sc().sc());
//...
}
}
上記の InitializeH2OModel(ctx) は、モデルをトレーニングするためのスパーク フレームを生成する複雑な関数です。プログラムは、h2o コンテキスト「H2OContext h2oContext = H2OContext.getOrCreate(ctx.sc().sc());」を開始する行まで正しく実行できます。
livy に追加する構成パラメーターは次のとおりです。
LivyClient client = new LivyClientBuilder().setURI(new URI(livyUrl)).setConf("spark.executor.instances", "9")
.setConf("spark.driver.memory", "20g")
.setConf("spark.driver.cores", "5")
.setConf("spark.executor.memory", "16g") // memory per executor
.setConf("spark.executor.cores", "5")
.setConf("spark.yarn.executor.memoryOverhead", "7000")
.setConf("spark.rdd.compress", "true")
.setConf("spark.default.parallelism", "3000")
.setConf("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setConf("spark.driver.extraJavaOptions", "-XX:+UseG1GC -XX:MaxPermSize=10000m -Xss5000m")
.setConf("spark.executor.extraJavaOptions", "-XX:+UseG1GC -XX:MaxPermSize=10000m -Xss5000m")
.setConf("spark.shuffle.compress", "true")
.setConf("spark.shuffle.spill.compress", "true")
.setConf("spark.kryoserializer.buffer.max", "1g")
.setConf("spark.shuffle.io.maxRetries", "6")
.setConf("spark.sql.shuffle.partitions", "7000")
.setConf("spark.sql.files.maxPartitionBytes", "5000")
.setConf("spark.driver.extraClassPath",
"/usr/hdp/2.6.2.0-205/phoenix/phoenix-4.7.0.2.6.2.0-205-client.jar:/usr/hdp/2.6.2.0-205/phoenix/phoenix-4.7.0.2.6.2.0-205-server.jar:/usr/hdp/2.6.2.0-205/phoenix/lib/phoenix-spark-4.7.0.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-common-1.1.2.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-server-1.1.2.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-server-1.1.2.2.6.2.0-205")
.setConf("spark.executor.extraClassPath",
"/usr/hdp/2.6.2.0-205/phoenix/phoenix-4.7.0.2.6.2.0-205-client.jar:/usr/hdp/2.6.2.0-205/phoenix/phoenix-4.7.0.2.6.2.0-205-server.jar:/usr/hdp/2.6.2.0-205/phoenix/lib/phoenix-spark-4.7.0.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-common-1.1.2.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-server-1.1.2.2.6.2.0-205.jar:/usr/hdp/2.6.2.0-205/hbase/lib/hbase-server-1.1.2.2.6.2.0-205")
.setConf("spark.ext.h2o.cluster.size", "-1")
.setConf("spark.ext.h2o.cloud.timeout", "60000")
.setConf("spark.ext.h2o.spreadrdd.retries", "-1")
.setConf("spark.ext.h2o.nthreads", "-1")
.setConf("spark.ext.h2o.disable.ga", "true")
.setConf("spark.ext.h2o.dummy.rdd.mul.factor", "10")
.setConf("spark.ext.h2o.fail.on.unsupported.spark.param", "false")
.setConf("spark.cassandra.input.split.size_in_mb", "64")
.setConf("spark.driver.maxResultSize", "3g")
.setConf("spark.network.timeout", "1000s")
.setConf("spark.executor.heartbeatInterval", "600s")
.build();
上記の HDP 2.6.2 を Spark 2.1.1 のクラスター モードで実行しています。