python - トレーニングセットに含まれていない単語を使用した分類の予測 (単純ベイズ)

Question

結果が「ネガティブ」か「ポジティブ」かを予測する Naive Bayes モデルを作成しました。私が抱えている問題は、モデルにないいくつかの単語を含む新しいデータセットでモデルを実行することです。新しいデータセットを予測するために受け取るエラーは次のとおりです。

ValueError: 6 つの機能を持つ入力が必要でしたが、代わりに 4 を取得しました

モデルに Laplace Smoother を配置する必要があることを読みましたが、Bernoulli() の既定のアルファは既に 1 です。エラーを修正するために他に何ができますか? ありがとうございました

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import accuracy_score
import textblob as TextBlob



#scikit
comments = list(['happy','sad','this is negative','this is positive', 'i like this', 'why do i hate this'])
classes = list(['positive','negative','negative','positive','positive','negative'])


# preprocess creates the term frequency matrix for the review data set
stop = stopwords.words('english')
count_vectorizer = CountVectorizer(analyzer =u'word',stop_words = stop, ngram_range=(1, 3))
comments = count_vectorizer.fit_transform(comments)
tfidf_comments = TfidfTransformer(use_idf=True).fit_transform(comments)


# preparing data for split validation. 60% training, 40% test
data_train,data_test,target_train,target_test = cross_validation.train_test_split(tfidf_comments,classes,test_size=0.2,random_state=43)
classifier = BernoulliNB().fit(data_train,target_train)

#new data
comments_new = list(['positive','zebra','george','nothing'])
comments_new = count_vectorizer.fit_transform(comments_new)
tfidf_comments_new = TfidfTransformer(use_idf=True).fit_transform(comments_new)

classifier.predict(tfidf_comments_new)

python - トレーニング セットに含まれていない単語を使用した分類の予測 (単純ベイズ)

2 に答える 2

Related

Reference

python - トレーニングセットに含まれていない単語を使用した分類の予測 (単純ベイズ)