apache-spark - PySpark & MLLib: ランダムフォレスト予測のクラス確率

Question

PySpark を使用してトレーニングしたランダムフォレストオブジェクトのクラス確率を抽出しようとしています。ただし、ドキュメントのどこにもその例は見当たりませんし、の方法でもありませんRandomForestModel。

RandomForestModelPySparkの分類子からクラス確率を抽出するにはどうすればよいですか?

以下は、(確率ではなく) 最終的なクラスのみを提供するドキュメントで提供されているサンプルコードです。

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

メソッドが表示されません。model.predict_proba()どうすればよいですか??

score 11 · Accepted Answer

私が知る限り、これは現在のバージョン (1.2.1) ではサポートされていません。ネイティブ Scala コード (tree.py) に対する Python ラッパーは、それぞれの Scala 対応物 (treeEnsembleModels.scala) を呼び出す「予測」関数のみを定義します。後者は、バイナリ決定の間で投票を行うことによって決定を下します。よりクリーンな解決策は、任意にしきい値を設定したり、sklearn のように ROC 計算に使用したりできる確率的予測を提供することでした。この機能は、将来のリリースで追加する必要があります。

回避策として、predict_proba を純粋な Python 関数として実装しました (以下の例を参照)。フォレスト内の個々の決定木のセットに対してループを実行するため、エレガントでも非常に効率的でもありません。その秘訣 (というか汚いハック) は、Java デシジョンツリーモデルの配列にアクセスし、それらを対応する Python モデルにキャストすることです。その後、データセット全体で個々のモデルの予測を計算し、「zip」を使用して RDD に合計を蓄積できます。木の数で割ると、望ましい結果が得られます。大規模なデータセットの場合、マスターノード内の少数の決定木に対するループは許容されます。

以下のコードは、Python を Spark (Java で実行) に統合するのが難しいため、ややこしいものになっています。複雑なデータをワーカーノードに送信しないように十分に注意する必要があります。これにより、シリアライゼーションの問題によるクラッシュが発生します。Spark コンテキストを参照するコードをワーカーノードで実行することはできません。また、Java コードを参照するコードはシリアライズできません。たとえば、以下のコードでは、ntrees の代わりに len(trees) を使用したくなるかもしれません。Java/Scala でこのようなラッパーを作成すると、たとえば、ワーカーノードで決定木に対してループを実行して通信コストを削減することで、はるかに洗練されたものにすることができます。

以下のテスト関数は、predict_proba が元の例で使用された予測と同じテストエラーを与えることを示しています。

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel. 
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

全体として、これは Spark を学ぶための良い練習になりました。

score 6 · Accepted Answer

これが利用可能になりました。

Spark ML は以下を提供します。

予測されpredictionColたラベルを含むa
および各ラベルprobabilityColの確率を持つベクトルを含む a 、これはあなたが探しているものです!
raw カウントにアクセスすることもできます

詳細については、Spark のドキュメントをご覧ください: http://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns-predictions

score 1 · Accepted Answer

1

ただし、Spark 1.5.0 と新しい Spark-ML API で利用できるようになります。

于 2015-08-04T09:50:00.080 に答える

apache-spark - PySpark & MLLib: ランダム フォレスト予測のクラス確率

4 に答える 4

Related

Reference

apache-spark - PySpark & MLLib: ランダムフォレスト予測のクラス確率