python - Orangeを使用してデータを階層化する方法は?

Question

そこにいるオレンジの専門家からの助けを探しています。

私は約600万行のデータセットを持っています。簡単にするために、2 つの列だけを考えます。1 つは正の 10 進数で、連続値としてインポートされます。もう 1 つは離散値 (0 または 1) で、1 と 0 の比率は 30:1 です。

分類子を取得するために、分類ツリー (「学習者」とラベル付けしています) を使用しています。次に、圧倒的な 30:1 のサンプルバイアスを調整しながら、データセットに対してクロス検証を実行しようとしています。これを行うためにいくつかのバリエーションを試しましたが、データを階層化するかどうかに関係なく、引き続き同じ結果が得られます。

以下は私のコードで、私が試したさまざまな行をコメントアウトしました（階層化のためにTrueとFalseの両方の値を使用しています）：

import Orange
import os
import time
import operator

start = time.time()
print "Starting"
print ""

mydata = Orange.data.Table("testData.csv")

# This is used only for the test_with_indices method below
indicesCV = Orange.data.sample.SubsetIndicesCV(mydata)

# I only want the highest level classifier so max_depth=1
learner = Orange.classification.tree.TreeLearner(max_depth=1)

# These are the lines I've tried:
#res = Orange.evaluation.testing.cross_validation([learner], mydata, folds=5, stratified=True)
#res = Orange.evaluation.testing.proportion_test([learner], mydata, 0.8, 100, store_classifiers=1)
res = Orange.evaluation.testing.proportion_test([learner], mydata, learning_proportion=0.8, times=10, stratification=True, store_classifiers=1)
#res = Orange.evaluation.testing.test_with_indices([learner], mydata, indicesCV)

f = open('results.txt', 'a')
divString = "\n##### RESULTS (" + time.strftime("%Y-%m-%d %H:%M:%S") + ") #####"
f.write(divString)
f.write("\nAccuracy:     %.2f" %  Orange.evaluation.scoring.CA(res)[0])
f.write("\nPrecision:    %.2f" % Orange.evaluation.scoring.Precision(res)[0])
f.write("\nRecall:       %.2f" % Orange.evaluation.scoring.Recall(res)[0])
f.write("\nF1:           %.2f\n" % Orange.evaluation.scoring.F1(res)[0])

tree = learner(mydata)

f.write(tree.to_string(leaf_str="%V (%M out of %N)"))
print tree.to_string(leaf_str="%V (%M out of %N)")

end = time.time()
print "Ending"
timeStr = "Execution time: " + str((end - start) / 60) + " minutes"
f.write(timeStr)

f.close()

注: 構文エラー (階層化と階層化) があるように見えるかもしれませんが、プログラムは例外なくそのまま実行されます。また、ドキュメントに stratified=StratifiedIfPossible のようなものが示されていることは知っていますが、何らかの理由で、ブール値のみが機能します。

python - Orangeを使用してデータを階層化する方法は?

1 に答える 1

Related

Reference