scala - Spark ML ランダムフォレストからクラスに対応する確率を取得する方法

Question

機械学習タスクに org.apache.spark.ml.Pipeline を使用しています。予測されたラベルだけでなく、実際の確率を知ることが特に重要であり、私はそれを得るのに苦労しています. ここでは、ランダムフォレストを使用してバイナリ分類タスクを実行しています。クラスのラベルは「はい」と「いいえ」です。ラベル「はい」の確率を出力したいと思います。確率はパイプライン出力として [0.69, 0.31] のように DenseVector に格納されますが、どちらが「はい」に対応するのかわかりません (0.69 か 0.31 か?)。labelIndexer から取得する方法が必要だと思いますか?

モデルをトレーニングするための私のタスクコードは次のとおりです

val sc = new SparkContext(new SparkConf().setAppName(" ML").setMaster("local"))
val data = .... // load data from file
val df = sqlContext.createDataFrame(data).toDF("label", "features")
val labelIndexer = new StringIndexer()
                      .setInputCol("label")
                      .setOutputCol("indexedLabel")
                      .fit(df)

val featureIndexer = new VectorIndexer()
                        .setInputCol("features")
                        .setOutputCol("indexedFeatures")
                        .setMaxCategories(2)
                        .fit(df)


// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))


// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)
  .setFeatureSubsetStrategy("auto")
  .setImpurity("gini")
  .setMaxDepth(4)
  .setMaxBins(32)

// Create pipeline
val pipeline = new Pipeline()
    .setStages(Array(labelIndexer, featureIndexer, rf,labelConverter))

// Train model
val model = pipeline.fit(trainingData)

// Save model
sc.parallelize(Seq(model), 1).saveAsObjectFile("/my/path/pipeline")

次に、パイプラインをロードし、新しいデータで予測を行います。コード部分は次のとおりです。

// Ignoring loading data part

// Create DF
val testdf = sqlContext.createDataFrame(testData).toDF("features", "line")
// Load pipeline
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("/my/path/pipeline").first

// My Question comes here : How to extract the probability that corresponding to class label "1"
// This is my attempt, I would like to output probability for label "Yes" and predicted label . The probabilities are stored in a denseVector, but I don't know which one is corresponding to "Yes". Something like this:
val predictions = model.transform(testdf).select("probability").map(e=>   e.asInstanceOf[DenseVector])

RF の確率とラベルに関する参照: http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forests

scala - Spark ML ランダム フォレストからクラスに対応する確率を取得する方法

2 に答える 2

Related

Reference

scala - Spark ML ランダムフォレストからクラスに対応する確率を取得する方法