machine-learning - scikit-learn RandomForestClassifierで機能の重要性と森林構造はどのように関連していますか?

Question

これは、Iris データセットを使用した、私の問題の簡単な例です。機能の重要度がどのように計算されるか、およびを使用して推定量の森を視覚化するときにこれがどのように表示されるかを理解しようとすると、私は困惑しexport_graphvizます。これが私のコードです：

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

data = load_iris()
X = pd.DataFrame(data=data.data,columns=['sepallength', 'sepalwidth', 'petallength','petalwidth'])
y = pd.DataFrame(data=data.target)

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=2,max_depth=1)
rf.fit(X_train,y_train.iloc[:,0])

フォレストには深さ 1 の 2 本の木が含まれているため、分類子のパフォーマンスは低くなります (スコアは 0.68)。とにかく、これはここでは問題になりません。

機能の重要度は次のように取得されます。

importances = rf.feature_importances_
std = np.std([rf.feature_importances_ for tree in rf.estimators_],axis=0)
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

出力は次のとおりです。

Feature ranking:
1. feature sepallength (1.000000)
2. feature sepalwidth (0.000000)
3. feature petallength (0.000000)
4. feature petalwidth (0.000000)

次のコードを使用して構築されたツリーの構造を表示すると、次のようになります。

from sklearn.tree import export_graphviz
export_graphviz(rf.estimators_[0],
                feature_names=X.columns,
                filled=True,
                rounded=True)
!dot -Tpng tree.dot -o tree0.png
from IPython.display import Image
Image('tree0.png')

この2つの図を取得します

ツリー #0 のエクスポート:

ツリー #1 のエクスポート:

図に示すように、どのように重要度= 1sepallengthを持つことができるのか理解できませんが、両方のツリーでノード分割に使用されません (のみが使用されます)。petallength

score 3 · Accepted Answer

バグがあります

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

で並べ替える場合は、すべてindices = np.argsort(importances)[::-1]を並べ替える必要があります。ある順序に従ってラベルを保持したり、別の順序に従って重要度を保持したりしないでください。

上記を次のように置き換えると

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[f]))

この場合、フォレストとそのツリーはすべて、インデックス 2 のフィーチャだけが重要であるという点で一致しています。

machine-learning - scikit-learn RandomForestClassifierで機能の重要性と森林構造はどのように関連していますか?

1 に答える 1

Related

Reference