class - scikit-learn で sample_weight を使用してデータセットのバランスを取ろうとしています

Question

分類に RandomForest を使用していますが、5830-いいえ、1006-はいのように、バランスの取れていないデータセットを取得しました。データセットを class_weight と sample_weight でバランスを取ろうとしましたが、できません。

私のコードは次のとおりです。

X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

しかし、class_weight と sample_weight を使用しても、比率の TPR、FPR、ROC は改善されません。

なんで？私は何か間違ったことをしていますか？

それにもかかわらず、balanced_subsample という関数を使用すると、比率が大幅に改善されます。

def balanced_subsample(x,y,subsample_size):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

私の新しいコードは次のとおりです。

X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

ありがとう

score 2 · Accepted Answer

これはまだ完全な答えではありませんが、そこにたどり着くのに役立つことを願っています.

最初にいくつかの一般的な注意事項:

この種の問題をデバッグするには、決定論的な動作が役立つことがよくあります。固有のランダム性を持つさまざまな scikit-learn オブジェクトにrandom_state属性を渡して、実行ごとに同じ結果を得ることができます。RandomForestClassifierまた、次のものが必要です。
```
import numpy as np
np.random.seed()
import random
random.seed()
```

balanced_subsample関数がすべての実行で同じように動作するようにします。

でグリッド検索を行わないでくださいn_estimators。ランダムフォレストでは、ツリーが多いほど常に優れています。
sample_weightとclass_weightには同様の目的があることに注意してください。実際のサンプルの重みはsample_weight* から推定class_weightされる重みになります。

試していただけますか：

関数で subsample=1 を使用しますbalanced_subsample。そうしない特別な理由がない限り、同様の数のサンプルで結果を比較したほうがよいでしょう。
class_weightとのsample_weight両方を [なし] に設定してサブサンプリング戦略を使用します。

編集：あなたのコメントをもう一度読んで、あなたの結果がそれほど驚くべきものではないことに気付きました!
TPR は高くなりますが、FPRは低くなります。
これは、分類子がクラス 1 のサンプルを正しく取得しようと懸命に努力することを意味するだけであり、その結果、より多くの誤検知が発生します (もちろん、より多くの正しく取得することもできます!)。
クラス/サンプルの重みを同じ方向に増やし続けると、この傾向が続くことがわかります。

class - scikit-learn で sample_weight を使用してデータセットのバランスを取ろうとしています

2 に答える 2

Related

Reference