scala - Spark ML NaiveBayes がトレーニングデータとは異なるラベルを出力するのはなぜですか?

Question

Apache Spark ML (バージョン 1.5.1)のNaiveBayes分類子を使用して、いくつかのテキストカテゴリを予測します。ただし、分類器は、トレーニングセットのラベルとは異なるラベルを出力します。私はそれを間違っていますか？

たとえば、Zeppelin ノートブックに貼り付けることができる小さな例を次に示します。

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = sqlContext.createDataFrame(Seq(
  (0L, "X totally sucks :-(", 100.0),
  (1L, "Today was kind of meh", 200.0),
  (2L, "I'm so happy :-)", 300.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val nb = new NaiveBayes()

val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, nb))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

// Prepare test documents, which are unlabeled (id, text) tuples.
val test = sqlContext.createDataFrame(Seq(
  (4L, "roller coasters are fun :-)"),
  (5L, "i burned my bacon :-("),
  (6L, "the movie is kind of meh")
)).toDF("id", "text")

// Make predictions on test documents.
model.transform(test)
  .select("id", "text", "prediction")
  .collect()
  .foreach { case Row(id: Long, text: String, prediction: Double) =>
    println(s"($id, $text) --> prediction=$prediction")
  }

小さなプログラムからの出力:

(4, roller coasters are fun :-)) --> prediction=2.0
(5, i burned my bacon :-() --> prediction=0.0
(6, the movie is kind of meh) --> prediction=1.0

予測されたラベルのセット {0.0, 1.0, 2.0} は、トレーニングセットのラベル {100.0, 200.0, 300.0} とは素です。

質問: これらの予測されたラベルを元のトレーニングセットのラベルにマッピングするにはどうすればよいですか?

おまけの質問: 他のタイプはラベルと同じように機能するのに、なぜトレーニングセットのラベルは double でなければならないのですか? 不要に思えます。

score 4 · Accepted Answer

ただし、分類器は、トレーニングセットのラベルとは異なるラベルを出力します。私はそれを間違っていますか？

すこし。私が知る限り、SPARK-9137 で説明されている問題にぶつかっています。一般的に言えば、ML のすべての分類子は 0 ベースのラベル (0.0、1.0、2.0、...) を想定していますが、検証ステップはありませんml.NaiveBayes。内部ではデータが渡され、mllib.NaiveBayesこの制限がないため、トレーニングプロセスはスムーズに機能します。

モデルが変換されてに戻るとml、予測関数はラベルが正しいと仮定し、を使用して予測されたラベルを返すVector.argmaxため、結果が得られます。たとえば、を使用してラベルを修正できますStringIndexer。

他のタイプがラベルと同じように機能するのに、なぜトレーニングセットのラベルは double でなければならないのですか?

シンプルで再利用可能な API を維持することが主な問題だと思います。この方法LabeledPointは、分類問題と回帰問題の両方に使用できます。さらに、メモリ使用量と計算コストの点で効率的な表現です。

scala - Spark ML NaiveBayes がトレーニング データとは異なるラベルを出力するのはなぜですか?

1 に答える 1

Related

Reference

scala - Spark ML NaiveBayes がトレーニングデータとは異なるラベルを出力するのはなぜですか?