machine-learning - scikit-learn (またはその他の python フレームワーク) を使用したさまざまな種類のリグレッサーのアンサンブル

Question

回帰タスクを解決しようとしています。LassoLARS、SVR、Gradient Tree Boosting の 3 つのモデルが、データのさまざまなサブセットに対して適切に機能していることがわかりました。これら 3 つのモデルすべてを使用して予測を行い、「真の出力」と 3 つのモデルの出力の表を作成すると、モデルの少なくとも 1 つが真の出力に非常に近いことがわかりますが、他の 2 つのモデルは常にそうです。比較的遠いかもしれません。

最小の可能性のあるエラーを計算すると (各テスト例の「最良の」予測子から予測を取得した場合)、モデル単独のエラーよりもはるかに小さいエラーが得られます。そこで、これら 3 つの異なるモデルからの予測を組み合わせて、ある種のアンサンブルを作ろうと考えました。質問は、これを適切に行う方法ですか？私の 3 つのモデルはすべて scikit-learn を使用して構築および調整されていますが、モデルをアンサンブルにパックするために使用できる何らかの方法を提供していますか? ここでの問題は、3 つのモデルすべてから予測を平均するだけではなく、特定の例のプロパティに基づいて重み付けを決定する必要がある重み付けでこれを行いたいことです。

scikit-learn がそのような機能を提供しない場合でも、データ内の各例の各モデルの重み付けを計算するという、このタスクに対処するプロパティを誰かが知っていると便利です。これら 3 つのモデルすべての上に構築された個別のリグレッサーによって実行される可能性があると思います。これは、3 つのモデルのそれぞれに最適な重みを出力しようとしますが、これが最善の方法であるかどうかはわかりません。

score 20 · Accepted Answer

これは、階層的予測に関する既知の興味深い (そしてしばしば苦痛な!) 問題です。トレーニングデータに対して多数の予測子をトレーニングし、次にトレーニングデータを使用してより高い予測子をトレーニングする際の問題は、バイアス分散分解に関係しています。

2 つの予測子があり、一方が他方の過剰適合バージョンであると仮定すると、前者は後者よりも優れているようにトレインセットに表示されます。結合予測子は、オーバーフィッティングと真の高品質予測を区別できないという理由だけで、真の理由なく前者を優先します。

これに対処する既知の方法は、トレーニングデータの各行、各予測子に対して、この行に適合しないモデルに基づいて行の予測を準備することです。過剰適合バージョンの場合、たとえば、これは行に対して平均して良い結果を生成しません。結合予測子は、下位レベルの予測子を結合するための公平なモデルをより適切に評価できるようになります。

Shahar Azulay と私は、これに対処するための Transformer ステージを作成しました。

class Stacker(object):
    """
    A transformer applying fitting a predictor `pred` to data in a way
        that will allow a higher-up predictor to build a model utilizing both this 
        and other predictors correctly.

    The fit_transform(self, x, y) of this class will create a column matrix, whose 
        each row contains the prediction of `pred` fitted on other rows than this one. 
        This allows a higher-level predictor to correctly fit a model on this, and other
        column matrices obtained from other lower-level predictors.

    The fit(self, x, y) and transform(self, x_) methods, will fit `pred` on all 
        of `x`, and transform the output of `x_` (which is either `x` or not) using the fitted 
        `pred`.

    Arguments:    
        pred: A lower-level predictor to stack.

        cv_fn: Function taking `x`, and returning a cross-validation object. In `fit_transform`
            th train and test indices of the object will be iterated over. For each iteration, `pred` will
            be fitted to the `x` and `y` with rows corresponding to the
            train indices, and the test indices of the output will be obtained
            by predicting on the corresponding indices of `x`.
    """
    def __init__(self, pred, cv_fn=lambda x: sklearn.cross_validation.LeaveOneOut(x.shape[0])):
        self._pred, self._cv_fn  = pred, cv_fn

    def fit_transform(self, x, y):
        x_trans = self._train_transform(x, y)

        self.fit(x, y)

        return x_trans

    def fit(self, x, y):
        """
        Same signature as any sklearn transformer.
        """
        self._pred.fit(x, y)

        return self

    def transform(self, x):
        """
        Same signature as any sklearn transformer.
        """
        return self._test_transform(x)

    def _train_transform(self, x, y):
        x_trans = np.nan * np.ones((x.shape[0], 1))

        all_te = set()
        for tr, te in self._cv_fn(x):
            all_te = all_te | set(te)
            x_trans[te, 0] = self._pred.fit(x[tr, :], y[tr]).predict(x[te, :]) 
        if all_te != set(range(x.shape[0])):
            warnings.warn('Not all indices covered by Stacker', sklearn.exceptions.FitFailedWarning)

        return x_trans

    def _test_transform(self, x):
        return self._pred.predict(x)

@MaximHaytovichの回答で説明されている設定の改善の例を次に示します。

まず、セットアップ:

    from sklearn import linear_model
    from sklearn import cross_validation
    from sklearn import ensemble
    from sklearn import metrics

    y = np.random.randn(100)
    x0 = (y + 0.1 * np.random.randn(100)).reshape((100, 1)) 
    x1 = (y + 0.1 * np.random.randn(100)).reshape((100, 1)) 
    x = np.zeros((100, 2))

x0とx1はの単なるノイズバージョンであることに注意してくださいy。最初の 80 行をトレーニングに使用し、最後の 20 行をテストに使用します。

高分散勾配ブースターと線形予測の 2 つの予測子があります。

    g = ensemble.GradientBoostingRegressor()
    l = linear_model.LinearRegression()

回答で提案されている方法論は次のとおりです。

    g.fit(x0[: 80, :], y[: 80])
    l.fit(x1[: 80, :], y[: 80])

    x[:, 0] = g.predict(x0)
    x[:, 1] = l.predict(x1)

    >>> metrics.r2_score(
        y[80: ],
        linear_model.LinearRegression().fit(x[: 80, :], y[: 80]).predict(x[80: , :]))
    0.940017788444

さて、スタッキングを使用して：

    x[: 80, 0] = Stacker(g).fit_transform(x0[: 80, :], y[: 80])[:, 0]
    x[: 80, 1] = Stacker(l).fit_transform(x1[: 80, :], y[: 80])[:, 0]

    u = linear_model.LinearRegression().fit(x[: 80, :], y[: 80])

    x[80: , 0] = Stacker(g).fit(x0[: 80, :], y[: 80]).transform(x0[80:, :])
    x[80: , 1] = Stacker(l).fit(x1[: 80, :], y[: 80]).transform(x1[80:, :])

    >>> metrics.r2_score(
        y[80: ],
        u.predict(x[80:, :]))
    0.992196564279

スタッキング予測の方が優れています。勾配ブースターがそれほど優れていないことがわかります。

score 10 · Accepted Answer

OK、グーグルの「スタッキング」に時間を費やした後（@andreasが以前に述べたように）、scikit-learnを使用してもPythonで重み付けを行う方法を見つけました。以下を検討してください。

一連の回帰モデルをトレーニングします (前述の SVR、LassoLars、GradientBoostingRegressor)。次に、トレーニングデータ (これら 3 つのリグレッサーのそれぞれのトレーニングに使用されたのと同じデータ) でそれらすべてを実行します。各アルゴリズムで例の予測を取得し、これら 3 つの結果を列「predictedSVR」、「predictedLASSO」、および「predictedGBR」で pandas データフレームに保存します。そして、実際の予測値である「予測」と呼ぶこのデータフレームに最後の列を追加します。

次に、この新しいデータフレームで線形回帰をトレーニングします。

#df - dataframe with results of 3 regressors and true output
from sklearn linear_model
stacker= linear_model.LinearRegression()
stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])

したがって、新しい例の予測を行いたい場合は、3 つのリグレッサーをそれぞれ個別に実行してから、次のようにします。

stacker.predict()

私の3つのリグレッサーの出力について。そして結果を得る。

ここでの問題は、リグレッサーの最適な重みを見つけていることです。平均して、予測を試みる各例の重みは同じになります。

score 5 · Accepted Answer

あなたが説明することは、scikit-learn にはまだ実装されていない「スタッキング」と呼ばれますが、貢献は歓迎されると思います。平均するだけのアンサンブルが間もなく登場します: https://github.com/scikit-learn/scikit-learn/pull/4161

machine-learning - scikit-learn (またはその他の python フレームワーク) を使用したさまざまな種類のリグレッサーのアンサンブル

4 に答える 4

Related

Reference