tree - ツリー出力を使用して、Spark で勾配ブースティングツリーの場合のクラスの確率を予測する

Question

現在、Spark の GBT は予測されたラベルを提供することが知られています。

クラスの予測確率を計算しようと考えていました（すべてのインスタンスが特定の葉の下にあるとします）

GBT を構築するためのコード

import org.apache.spark.SparkContext
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

//Importing the data
val data = sc.textFile("data/mllib/credit_approval_2_attr.csv") //using the credit approval data set from UCI machine learning repository

//Parsing the data
val parsedData = data.map { line =>
    val parts = line.split(',').map(_.toDouble)
    LabeledPoint(parts(0), Vectors.dense(parts.tail))
}

//Splitting the data
val splits = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training = splits(0).cache() 
val test = splits(1)

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 2 // We can use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 2
boostingStrategy.treeStrategy.maxBins = 32
boostingStrategy.treeStrategy.subsamplingRate = 0.5
boostingStrategy.treeStrategy.maxMemoryInMB =1024
boostingStrategy.learningRate = 0.1

// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(training, boostingStrategy)  

model.toDebugString

これにより、簡単にするために、以下のように深さ 2 の 2 つのツリーが得られます。

 Tree 0:
    If (feature 3 <= 2.0)
     If (feature 2 <= 1.25)
      Predict: -0.5752212389380531
     Else (feature 2 > 1.25)
      Predict: 0.07462686567164178
    Else (feature 3 > 2.0)
     If (feature 0 <= 30.17)
      Predict: 0.7272727272727273
     Else (feature 0 > 30.17)
      Predict: 1.0
  Tree 1:
    If (feature 5 <= 67.0)
     If (feature 4 <= 100.0)
      Predict: 0.5739387416147804
     Else (feature 4 > 100.0)
      Predict: -0.550117566730937
    Else (feature 5 > 67.0)
     If (feature 2 <= 0.0)
      Predict: 3.0383669122382835
     Else (feature 2 > 0.0)
      Predict: 0.4332824083446489

私の質問は: 上記のツリーを使用して、次のような予測確率を計算できますか?

予測に使用される機能セットのすべてのインスタンスに関して

exp(ツリー 0 のリーフスコア + ツリー 1 のリーフスコア)/(1+exp(ツリー 0 のリーフスコア + ツリー 1 のリーフスコア))

これは一種の確率を与えてくれます。しかし、それが正しい方法かどうかはわかりません。また、リーフスコア（予測）の計算方法を説明するドキュメントがある場合。共感していただける方がいらっしゃれば本当に嬉しいです。

どんな提案でも素晴らしいでしょう。

score 2 · Accepted Answer

これが、Spark の内部依存関係を使用した私のアプローチです。後で行列演算のために線形代数ライブラリをインポートする必要があります。つまり、ツリー予測に学習率を掛けます。

import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}

GBT でモデルを構築するとします。

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

モデルオブジェクトを使用して確率を計算するには:

// Get the log odds predictions from each tree
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }

// Transform the arrays into matrices for multiplication
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)

// Calculate probability by ensembling the log odds
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
classProb.collect

// You may tweak your decision boundary for different class labels
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

これは、spark-shell に直接コピーして貼り付けることができるコードスニペットです。

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.linalg.distributed.{RowMatrix}
import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

// Load and parse the data file.
val csvData = sc.textFile("data/mllib/sample_tree_data.csv")
val data = csvData.map { line =>
  val parts = line.split(',').map(_.toDouble)
  LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GBT model.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 50
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 6
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Get class label from raw predict function
val predictedLabels = model.predict(testData.map(_.features))
predictedLabels.collect

// Get class probability
val treePredictions = testData.map { point => model.trees.map(_.predict(point.features)) }
val treePredictionsVector = treePredictions.map(array => Vectors.dense(array))
val treePredictionsMatrix = new RowMatrix(treePredictionsVector)
val learningRate = model.treeWeights
val learningRateMatrix = Matrices.dense(learningRate.size, 1, learningRate)
val weightedTreePredictions = treePredictionsMatrix.multiply(learningRateMatrix)
val classProb = weightedTreePredictions.rows.flatMap(_.toArray).map(x => 1 / (1 + Math.exp(-1 * x)))
val classLabel = classProb.map(x => if (x > 0.5) 1.0 else 0.0)
classLabel.collect

score 0 · Accepted Answer

実際、質問で与えられたツリーとツリーの定式化を使用して確率を予測できました。GBTの予測ラベル出力で実際に確認してみました。しきい値を 0.5 として使用すると、正確に一致します。

したがって、わずかな変更を加えて同じことを行います。

予測に使用される機能セットのすべてのインスタンスに関して:

exp(ツリー 0 のリーフスコア + (learning_rate)* ツリー 1 のリーフスコア)/(1+exp(ツリー 0 のリーフスコア + (learning_rate)* ツリー 1 のリーフスコア))

これにより、基本的に予測される確率が得られます。

深さ3の3本の木で同じことをテストしました。うまくいきました。また、さまざまなデータセットを使用します。

他の誰かがすでにこれを試しているかどうかを知ることは素晴らしいことです. そうでない場合は、これを試してコメントすることができます。

score 0 · Accepted Answer

実際、@hbghhy saw は間違っており、@Run2 は正しく、Spark は二項負対数尤度の 2 倍を損失として使用しますが、Friedman は「欲張り関数近似」の 9 ページで二項負対数尤度を損失として使用します。

/**
 * :: DeveloperApi ::
 * Class for log loss calculation (for classification).
 * This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
 *
 * The log loss is defined as:
 *   2 log(1 + exp(-2 y F(x)))
 * where y is a label in {-1, 1} and F(x) is the model prediction for features x.
 */
@Since("1.2.0")
@DeveloperApi
object LogLoss extends ClassificationLoss {

  /**
   * Method to calculate the loss gradients for the gradient boosting calculation for binary
   * classification
   * The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
   * @param prediction Predicted label.
   * @param label True label.
   * @return Loss gradient
   */
  @Since("1.2.0")
  override def gradient(prediction: Double, label: Double): Double = {
    - 4.0 * label / (1.0 + math.exp(2.0 * label * prediction))
  }

  override private[spark] def computeError(prediction: Double, label: Double): Double = {
    val margin = 2.0 * label * prediction
    // The following is equivalent to 2.0 * log(1 + exp(-margin)) but more numerically stable.
    2.0 * MLUtils.log1pExp(-margin)
  }
}

tree - ツリー出力を使用して、Spark で勾配ブースティング ツリーの場合のクラスの確率を予測する

5 に答える 5

Related

Reference

tree - ツリー出力を使用して、Spark で勾配ブースティングツリーの場合のクラスの確率を予測する