python - sklearn を使用した勾配ブースティング分類子損失関数 - オペランドを一緒にブロードキャストできませんでした

Question

sklearn Gradient Boosting Classifier の estimator.loss_ メソッドに問題があります。時間の経過に伴うトレーニングエラーと比較して、テストエラーをグラフ化しようとしています。これが私のデータ準備の一部です：

# convert data to numpy array
train = np.array(shuffled_ds)

#label encode neighborhoods
for i in range(train.shape[1]):
if i in [1,2]:
    print(i,list(train[1:5,i]))
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train[:,i]))
    train[:,i] = lbl.transform(train[:,i])
print('neighborhoods & crimes encoded')

#create target vector
y_crimes = train[::,1]
train=np.delete(train,1,1)
print(y_crimes)

#arrays to float
train = train.astype(float)
y_crimes = y_crimes.astype(float)

#data holdout for testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    train, y_crimes, test_size=0.4, random_state=0)
print('test data created')

#train model and check train vs test error
print('begin training...')
est=GBC(n_estimators = 3000,learning_rate=.1,max_depth=4,max_features=1,min_samples_leaf=3)
est.fit(X_train,y_train)
print('done training')

この時点で、配列の形状を

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

私は得る：

(18000, 9)
(12000, 9)
(18000,)
(12000,)

それぞれ。

したがって、sklearnのドキュメントによると、私の形状は互換性があります。しかし次に、テストスコアベクトルを入力して、トレーニングエラーと比較するためにグラフ化します。

test_score=np.empty(len(est.estimators_))
for i, pred in enumerate(est.staged_predict(X_test)):
    test_score[i] = est.loss_(y_test,pred)

次のエラーが表示されます。

: operands could not be broadcast together with shapes (12000,47) (12000,) 
         return np.sum(-1 * (Y * pred).sum(axis=1) +
543    544else:ValueError

その47がどこから来ているのかわかりません。以前に別のデータセットで同じ手順を使用しましたが、問題はありませんでした。どんな助けでも大歓迎です。

score 0 · Accepted Answer

このエラーを発行したのは、staged_decision_function (staged_predict ではなく) メソッドの結果を loss_ に渡す必要があるためです。

こちらをご覧ください勾配ブースティングの正則化

clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

# compute test set deviance
test_deviance = np.zeros((params['n_estimators'],), dtype=np.float64)

for i, y_pred in enumerate(clf.staged_decision_function(X_test)):
    # clf.loss_ assumes that y_test[i] in {0, 1}
    test_deviance[i] = clf.loss_(y_test, y_pred)

python - sklearn を使用した勾配ブースティング分類子損失関数 - オペランドを一緒にブロードキャストできませんでした

1 に答える 1

Related

Reference