python - Using sklearn and Python for a large application classification/scraping exercise

Question

I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ. The research framework is as follows:

1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler) 2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result

My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be? Or perhaps Python is enough if accompanied with some parallelization of the code? Thanks

score 5 · Accepted Answer

およびAPI をHashingVectorizerサポートする線形分類モジュールの 1 つを使用するか、前もってメモリ内のすべてのデータをベクトル化してロードすることなく、モデルを段階的に学習します。数十万の (ハッシュ化された) 機能を持つドキュメント。partial_fitSGDClassifierPerceptronPassiveAggresiveClassifier

ただし、メモリに収まる小さなサブサンプル (10 万ドキュメントなど) をロードし、パイプラインオブジェクトとRandomizedSearchCVマスターブランチのクラスを使用してベクトライザーの適切なパラメーターをグリッド検索する必要があります。RandomizedSearchCVまた、メモリに収まる同じまたはそれよりも大きい事前にベクトル化されたデータセット (数百万のドキュメントなど) を使用して、正則化パラメーターの値 (PassiveAggressiveClassifier の場合は C、SGDClassifier の場合は alpha など) を微調整することもできます。

また、線形モデルを平均化 (2 つの線形モデルのcoef_とを平均化intercept_) できるため、データセットを分割し、線形モデルを個別に学習してから、モデルを平均化して最終モデルを取得できます。

score 3 · Accepted Answer

基本的に、numpy、scipy、および sklearn に依存している場合、これらのライブラリの最も重要な部分は C 拡張として実装されているため、Python がボトルネックになることはありません。

しかし、何百万ものサイトをスクレイピングしているため、1 台のマシンの機能に制限されることになります。PiCloud [1] や Amazon Web Services (EC2) などのサービスを使用して、ワークロードを多数のサーバーに分散することを検討します。

例として、Cloud Queues [2] を介してスクレイピングを集中させることが挙げられます。

[1] http://www.picloud.com

[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/

python - Using sklearn and Python for a large application classification/scraping exercise

2 に答える 2

Related

Reference