apache-spark - Spark Pipeline で RandomForest を使用する方法

Question

モデルをグリッド検索で調整し、spark で相互検証したいと考えています。Spark では、基本モデルをパイプラインに配置する必要があります。パイプラインのオフィスデモではLogistictRegression、オブジェクトとして新しい可能性のある基本モデルとして使用します。ただし、クライアントコードでRandomForestモデルを新規RandomForest作成することはできないため、パイプライン API では使用できないようです。ホイールを再作成したくないので、誰かアドバイスをいただけますか？ありがとう

score 5 · Accepted Answer

ただし、RandomForest モデルはクライアントコードで新規作成できないため、パイプライン API で RandomForest を使用することはできないようです。

それは本当ですが、間違ったクラスを使用しようとしているだけです。代わりにmllib.tree.RandomForestを使用する必要がありますml.classification.RandomForestClassifier。これは、 MLlib docsの例に基づく例です。

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._ 

case class Record(category: String, features: Vector)

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))

val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("label")

val rf  = new RandomForestClassifier()
    .setNumTrees(3)
    .setFeatureSubsetStrategy("auto")
    .setImpurity("gini")
    .setMaxDepth(4)
    .setMaxBins(32)

val pipeline = new Pipeline()
    .setStages(Array(indexer, rf))

val model = pipeline.fit(trainDF)

model.transform(testDF)

ここで一つ解らなかったことがあります。私が知る限り、LabeledPoints直接抽出されたラベルを使用できるはずですが、何らかの理由で機能せず、pipeline.fit発生しIllegalArgumentExcetionます:

RandomForestClassifier に、指定されたクラス数なしで、無効なラベル列ラベルの入力が与えられました。

したがって、の醜いトリックStringIndexerです。適用後、必要な属性 ( {"vals":["1.0","0.0"],"type":"nominal","name":"label"}) を取得しますが、一部のクラスはそれがmlなくても問題なく動作するようです。

apache-spark - Spark Pipeline で RandomForest を使用する方法

1 に答える 1

Related

Reference