scikit-learn - Feature selection for multilabel classification (scikit-learn)

Question

I'm trying to do a feature selection by chi-square method in scikit-learn (sklearn.feature_selection.SelectKBest). When I'm trying to apply this to a multilabel problem, I get this warning:

UserWarning: Duplicate scores. Result may depend on feature ordering.There are probably duplicate features, or you used a classification score for a regression task. warn("Duplicate scores. Result may depend on feature ordering."

Why is it appearning and how to properly apply feature selection is this case?

score 5 · Accepted Answer

コードは、一部の機能のスコアがまったく同じであるため、任意のタイブレークを実行する必要がある可能性があることを警告します。

とはいえ、機能選択は、そのままではマルチラベルでは実際には機能しません。現在できる最善の方法は、機能の選択と分類子をパイプラインで結び付けてから、それをマルチラベルメタ推定器にフィードすることです。例 (未テスト):

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                ('svm', LinearSVC())])
multi_clf = OneVsRestClassifier(clf)

(この警告は、関連付けられた機能が実際には k 番目と (k+1) 番目ではない場合でも発行されると思います。通常は安全に無視できます。)

score 1 · Accepted Answer

トピックが少し古いことは知っていますが、次のことがうまくいきます。

clf = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
            ('lasso', OneVsRestClassifier(LogisticRegression()))])

scikit-learn - Feature selection for multilabel classification (scikit-learn)

2 に答える 2

Related

Reference