python-2.7 - sklearn.feature_extraction.text CountVectorizer 使用時にファイルからドキュメントを読み取る

Question

ドキュメントの例のようにコードを使用できます。ここで、 fit_transform() 関数への入力は文のリストです。つまり:

corpus = [
   'this is the first document',
   'this is the second second document',
   'and the third one',
   'is this the first document?'
]

X = vectorizer.fit_transform(コーパス)

期待されるデータを取得します。しかし、ドキュメントが示唆するように、コーパスをファイルのリストまたはファイルオブジェクトに置き換えようとすると、次のようになります。

" fit(raw_documents, y=None)

Learn a vocabulary dictionary of all tokens in the raw documents.
Parameters :    
raw_documents : iterable
    An iterable which yields either str, unicode or file objects.
Returns :   
self :

"

..だから、パイプラインについての私の理解には何かが欠けていると思います。CountVectorize したいファイルのディレクトリが与えられた場合、どうすればよいですか? [open(file,'r')] としてファイルオブジェクトのリストをフィードしようとすると、ファイルオブジェクトには下位関数がないというエラーメッセージが表示されます。

score 5 · Accepted Answer

ベクトライザーのinput コンストラクターパラメーターをまたはのいずれかにfilename設定しますfile。デフォルト値はですcontent。これは、既にファイルをメモリに読み込んでいることを前提としています。

python-2.7 - sklearn.feature_extraction.text CountVectorizer 使用時にファイルからドキュメントを読み取る

1 に答える 1

Related

Reference