python - パイプラインでテキスト (不均衡なグループ) をリサンプリングする方法は?

Question

MultinomialNB を使用してテキスト分類を行おうとしていますが、データのバランスが取れていないために問題が発生しています。(以下は簡単にするためのサンプルデータです。実際には、私のデータはもっと大きいです。) オーバーサンプリングを使用してデータをリサンプリングしようとしています。理想的には、このパイプラインに組み込みたいと考えています。

以下のパイプラインはオーバーサンプリングしなくても問題なく動作しますが、実際のデータではそれが必要です。とてもアンバランスです。

この現在のコードでは、「TypeError: すべての中間ステップはトランスフォーマーであり、フィットとトランスフォームを実装する必要があります」というエラーが発生し続けます。

このパイプラインに RandomOverSampler を組み込むにはどうすればよいですか?

data = [['round red fruit that is sweet','apple'],['long yellow fruit with a peel','banana'],
    ['round green fruit that is soft and sweet','pear'], ['red fruit that is common', 'apple'],
    ['tiny fruits that grow in bunches','grapes'],['purple fruits', 'grapes'], ['yellow and long', 'banana'],
    ['round, small, green', 'grapes'], ['can be red, green, or purple', 'grapes'], ['tiny fruits', 'grapes'], 
    ['small fruits', 'grapes']]

df = pd.DataFrame(data,columns=['Description','Type'])  

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()), 
                    ('RUS', RandomOverSampler()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(X_train, y_train)
y_pred = text_clf.predict(X_test)

print('Score:',text_clf.score(X_test, y_test))

python - パイプラインでテキスト (不均衡なグループ) をリサンプリングする方法は?

1 に答える 1

Related

Reference