python - scikit-learn バランスサブサンプリング

Question

大規模な不均衡なデータセットの N 個のバランスの取れたランダムサブサンプルを作成しようとしています。scikit-learn / pandas で簡単にこれを行う方法はありますか、それとも自分で実装する必要がありますか? これを行うコードへのポインタはありますか?

これらのサブサンプルはランダムである必要があり、分類子の非常に大きなアンサンブルで個別の分類子にそれぞれフィードするため、重複する可能性があります。

Wekaにはspreadsubsampleというツールがありますが、sklearnに同等のものはありますか? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

（重み付けについては知っていますが、それは私が探しているものではありません。）

score 9 · Accepted Answer

パンダシリーズのバージョン：

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample

score 5 · Accepted Answer

このタイプのデータ分割は、で公開されている組み込みのデータ分割手法では提供されていませんsklearn.cross_validation。

あなたのニーズに似ていると思われるのはsklearn.cross_validation.StratifiedShuffleSplit、データセット全体の構造を保持しながら任意のサイズのサブサンプルを生成できる、つまり、メインデータセットにあるのと同じ不均衡を細心の注意を払って適用できるです。これはあなたが探しているものではありませんが、その中のコードを使用して、課された比率を常に 50/50 に変更できる場合があります。

(気が向いたら、これはおそらく scikit-learn への非常に良い貢献になるでしょう。)

score 3 · Accepted Answer

これは、マルチクラスグループで機能する上記のコードのバージョンです (私のテストケースでは、グループ 0、1、2、3、4)。

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

これはインデックスも返すので、他のデータセットに使用したり、各データセットが使用された頻度を追跡したりできます (トレーニングに役立ちます)。

score 0 · Accepted Answer

すでに回答されていますが、似たようなものを探してあなたの質問に出くわしました。sklearn.model_selection.StratifiedKFoldさらに調査した結果、この目的に使用できると思います。

from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])

を追加することが重要です。_なぜなら、skf.split()は K 分割交差検証の層別分割を作成するために使用されるため、2 つのインデックスのリストtrain(n - 1 / n要素) とテスト (1 / n要素) を返すからです。

これはsklearn 0.18の時点であることに注意してください。sklearn 0.17では、代わりにモジュールで同じ関数を見つけることができますcross_validation。

score 0 · Accepted Answer

私のサブサンプラーバージョン、これが役立つことを願っています

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]

python - scikit-learn バランス サブサンプリング

15 に答える 15

Related

Reference

python - scikit-learn バランスサブサンプリング