python - scikit-learn を使用したテキスト特徴抽出

Question

Scikt-Learn パッケージを使用して、コーパスから特徴を抽出しています。私のコードは次のとおりです。

#! /usr/bin/python -tt

from __future__ import division
import re
import random
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from nltk.cluster.util import cosine_distance
from operator import itemgetter

def preprocess(fnin, fnout):
  fin = open(fnin, 'rb')
  fout = open(fnout, 'wb')
  buf = []
  id = ""
  category = ""
  for line in fin:
    line = line.strip()

    if line.find("-- Document Separator --") > -1:
      if len(buf) > 0:
        # write out body,
        body = re.sub("\s+", " ", " ".join(buf))
        fout.write("%s\t%s\t%s\n" % (id, category, body))
      # process next header and init buf
      id, category, rest = map(lambda x: x.strip(), line.split(": "))
      buf = []
    else:
      # process body
      buf.append(line)
  fin.close()
  fout.close()

def train(fnin):
  docs = []
  cats = []
  fin = open(fnin, 'rb')
  for line in fin:
    id, category, body = line.strip().split("\t")
    docs.append(body)
    cats.append(category)
  fin.close()
  v=CountVectorizer(min_df=1,stop_words="english")
  pipeline = Pipeline([
    ("vect", v),
    ("tfidf", TfidfTransformer(use_idf=False))])
  tdMatrix = pipeline.fit_transform(docs, cats)
  return tdMatrix, cats


def main():
  preprocess("corpus.txt", "sccpp.txt")
  tdMatrix, cats = train("sccpp.txt")

if __name__ == "__main__":
  main()

私のコーパスは(簡単な形式): corpus.txt

0: sugar: -- Document Separator -- reut2-021.sgm
British Sugar Plc was forced to shut its
Ipswich sugar factory on Sunday afternoon due to an acute
shortage of beet supplies, a spokesman said, responding to a
Reuter inquiry
    Beet supplies have dried up at Ipswich due to a combination
of very wet weather, which has prevented most farmers in the
factory's catchment area from harvesting, and last week's
hurricane which blocked roads.
    The Ipswich factory will remain closed until roads are
cleared and supplies of beet build up again.
    This is the first time in many years that a factory has
been closed in mid-campaign, the spokesman added.
    Other factories are continuing to process beet normally,
but harvesting remains very difficult in most areas.
    Ipswich is one of 13 sugar factories operated by British
Sugar. It processes in excess of 500,000 tonnes of beet a year
out of an annual beet crop of around eight mln tonnes.
    Despite the closure of Ipswich and the severe harvesting
problems in other factory areas, British Sugar is maintaining
its estimate of sugar production this campaign at around

エラーメッセージは次のとおりです。

v=CountVectorizer(min_df=1,stop_words="english")
TypeError: __init__() got an unexpected keyword argument 'min_df'

Linux Mint で python2.7.4 を使用しています。この問題を解決する方法について誰かアドバイスしてもらえますか? 前もって感謝します。

score 4 · Accepted Answer

新しい scikit-learn バージョンが必要です。Mint から 1 つを取り除きます。

sudo apt-get uninstall python-sklearn

新しいバージョンをビルドするために必要なパッケージをインストールします。

sudo apt-get install python-numpy-dev python-scipy-dev python-pip

次に、最新のリリースを取得し、pip を使用してビルドします。

sudo pip install scikit-learn

python - scikit-learn を使用したテキスト特徴抽出

1 に答える 1

Related

Reference