scala - StringIndexer を使用せずに Spark ML でバイナリ分類を行う方法

Question

私の機能は既に (0.0; 1.0) としてインデックス付けされているため、StringIndexer を使用せずにパイプラインで Spark ML DecisionTreeClassifier を使用しようとしています。ラベルとしての DecisionTreeClassifier には double 値が必要なため、次のコードが機能するはずです。

def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = {
  import sqlc.implicits._
  val trainingDF = training.toDF()
  //format of this dataframe: [label: double, features: vector]

  val featureIndexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexedFeatures")
    .setMaxCategories(4)
    .fit(trainingDF)

  val dt = new DecisionTreeClassifier()
    .setLabelCol("label")
    .setFeaturesCol("indexedFeatures")


  val pipeline = new Pipeline()
    .setStages(Array(featureIndexer, dt))
  pipeline.fit(trainingDF)
}

しかし、実際には私は得る

java.lang.IllegalArgumentException:
DecisionTreeClassifier was given input with invalid label column label,
without the number of classes specified. See StringIndexer.

もちろん、StringIndexer を配置して、二重の「ラベル」フィールドで動作させることもできますが、DecisionTreeClassifier の出力 rawPrediction 列を操作して、各行の 0.0 と 1.0 の確率を取得したい...

val predictions = model.transform(singletonDF) 
val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0)
val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)

StringIndexer を Pipeline に配置すると、入力ラベル "0.0" および "1.0" のインデックスが rawPrediction ベクトルでわかりません。これは、String インデクサーが値の頻度によってインデックスを作成するためです。これは変動する可能性があります。

StringIndexer を使用せずに DecisionTreeClassifier のデータを準備するのを手伝ってください。または、各行の元のラベル (0.0; 1.0) の確率を取得する別の方法を提案してください。

score 6 · Accepted Answer

必要なメタデータはいつでも手動で設定できます。

import sqlContext.implicits._
import org.apache.spark.ml.attribute.NominalAttribute

val meta = NominalAttribute
  .defaultAttr
  .withName("label")
  .withValues("0.0", "1.0")
  .toMetadata

val dfWithMeta = df.withColumn("label", $"label".as("label", meta))
pipeline.fit(dfWithMeta)

scala - StringIndexer を使用せずに Spark ML でバイナリ分類を行う方法

1 に答える 1

Related

Reference