python - スケーリングされたscikit-learn論理回帰係数を調整して、スケーリングされていないデータセットをスコアリングする方法は?

Question

現在、Scikit-Learn の LogisticRegression を使用してモデルを構築しています。利用した

from sklearn import preprocessing
scaler=preprocessing.StandardScaler().fit(build)
build_scaled = scaler.transform(build)

モデルをトレーニングする前に、すべての入力変数をスケーリングします。すべてが正常に機能し、適切なモデルが生成されますが、LogisticRegression.coeff_ によって生成される係数は、スケーリングされた変数に基づいていると理解しています。スケーリングされていないデータに適用できる係数を生成するためにそれらを調整するために使用できる係数への変換はありますか?

私は実動システムにモデルを実装することを楽しみにしており、モデルのスコアリングのために実動環境ですべての変数を何らかの方法で前処理する必要があるかどうかを判断しようとしています。

注: モデルは本番環境内で再コーディングする必要がある可能性が高く、環境は Python を使用していません。

score 6 · Accepted Answer

フィーチャを正規化するために適用したスケーリングで割る必要がありますが、ターゲットに適用したスケーリングを掛ける必要もあります。

仮定する

各特徴変数 x_i は scale_x_i によってスケーリング (除算) されました
ターゲット変数は scale_y でスケーリング (除算) されました

それから

orig_coef_i = coef_i_found_on_scaled_data / scale_x_i * scale_y

pandas と sklearn LinearRegression を使用した例を次に示します。

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

import numpy as np
import pandas as pd

boston = load_boston()
# Looking at the description of the data tells us the target variable name
# print boston.DESCR
data = pd.DataFrame(
    data = np.c_[boston.data, boston.target],
    columns = list(boston.feature_names) + ['MVAL'],
)
data.head()

X = boston.data
y = boston.target

lr = LinearRegression()
lr.fit(X,y)

orig_coefs = lr.coef_

coefs1 = pd.DataFrame(
    data={
        'feature': boston.feature_names, 
        'orig_coef' : orig_coefs, 
    }
)
coefs1

これは、スケーリングが適用されていない線形回帰の係数を示しています。

#  | feature| orig_coef
# 0| CRIM   | -0.107171
# 1| ZN     |  0.046395
# 2| INDUS  |  0.020860
# etc

すべての変数を正規化します

# Now we normalise the data
scalerX = StandardScaler().fit(X)
scalery = StandardScaler().fit(y.reshape(-1,1)) # Have to reshape to avoid warnings

normed_X = scalerX.transform(X)
normed_y = scalery.transform(y.reshape(-1,1)) # Have to reshape to avoid warnings

normed_y = normed_y.ravel() # Turn y back into a vector again

# Check it's worked
# print np.mean(X, axis=0), np.mean(y, axis=0) # Should be 0s
# print np.std(X, axis=0), np.std(y, axis=0)   # Should be 1s

この正規化されたデータで再び回帰を行うことができます...

# Now we redo our regression
lr = LinearRegression()
lr.fit(normed_X, normed_y)

coefs2 = pd.DataFrame(
    data={
        'feature' : boston.feature_names,
        'orig_coef' : orig_coefs,
        'norm_coef' : lr.coef_,
        'scaleX' : scalerX.scale_,
        'scaley' : scalery.scale_[0],
    },
    columns=['feature', 'orig_coef', 'norm_coef', 'scaleX', 'scaley']
)
coefs2

...そして、スケーリングを適用して元の係数を取得します

# We can recreate our original coefficients by dividing by the
# scale of the feature (scaleX) and multiplying by the scale
# of the target (scaleY)
coefs2['rescaled_coef'] = coefs2.norm_coef / coefs2.scaleX * coefs2.scaley
coefs2

これを行うと、元の係数が再作成されたことがわかります。

#  | feature| orig_coef| norm_coef|    scaleX|   scaley| rescaled_coef
# 0| CRIM   | -0.107171| -0.100175|  8.588284| 9.188012| -0.107171
# 1| ZN     |  0.046395|  0.117651| 23.299396| 9.188012|  0.046395
# 2| INDUS  |  0.020860|  0.015560|  6.853571| 9.188012|  0.020860
# 3| CHAS   |  2.688561|  0.074249|  0.253743| 9.188012|  2.688561

一部の機械学習方法では、ターゲット変数 y と特徴変数 x を正規化する必要があります。それを行った場合は、元の回帰係数を取得するために、この「y のスケールを掛ける」ステップと「X_i のスケールで割る」ステップを含める必要があります。

それが役立つことを願っています

score 3 · Accepted Answer

簡単な答え、LogisticRegression 係数を取得し、スケーリングされていないデータをインターセプトします (バイナリ分類を想定し、lr はトレーニング済みの LogisticRegression オブジェクトです):

係数配列の要素ごとに (v0.17 以降) scaler.scale_ 配列で除算する必要があります。coefficients = np.true_divide(lr.coeff_, scaler.scale_)
結果の係数 (除算結果) 配列と scaler.mean_ 配列の内積を切片から減算する必要があります。intercept = lr.intercept_ - np.dot(coefficients, scaler.mean_)

上記を実行する必要がある理由がわかります。すべての機能がその平均値 (scaler.mean_ 配列に格納されている) を減算して正規化され、それを標準偏差 (scaler.scale_ 配列に格納されている) で除算することによって正規化されていると考えられる場合）。

python - スケーリングされたscikit-learn論理回帰係数を調整して、スケーリングされていないデータセットをスコアリングする方法は?

3 に答える 3

Related

Reference