apache - Mahoutレコメンダーをより速く動作させる方法は？

Question

SOのHahoutコミュニティ！

推奨計算の高速化についていくつか質問があります。私のサーバーには、HadoopなしでMahoutをインストールしています。また、jRubyは推奨スクリプトに使用されます。データベースには、3kのユーザーと100kのアイテム（結合テーブルに27万のアイテム）があります。したがって、ユーザーが推奨事項を要求すると、単純なスクリプトが機能し始めます。

まず、次のようにデータベース接続を確立しPGPoolingDataSourceます。

  connection = org.postgresql.ds.PGPoolingDataSource.new()
  connection.setDataSourceName("db_name");
  connection.setServerName("localhost")
  connection.setPortNumber(5432)
  connection.setDatabaseName("db_name")
  connection.setUser("mahout")
  connection.setPassword("password")
  connection.setMaxConnections(100)
  connection

この警告が表示されます：

WARNING: You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.

それを修正する方法はありますか？

その後、推奨事項を作成します。

model = PostgreSQLJDBCDataModel.new(
    connection,
    'stars',
    'user_id',
    'repo_id',
    'preference',
    'created_at'
  )

  similarity = TanimotoCoefficientSimilarity.new(model)
  neighborhood = NearestNUserNeighborhood.new(5, similarity, model)
  recommender = GenericBooleanPrefUserBasedRecommender.new(model, neighborhood, similarity)
  recommendations = recommender.recommend user_id, 30

今のところ、1人のユーザーの推奨を生成するのに約5〜10秒かかります。問題は、推奨をより速くする方法です（200msがいいでしょう）？

score 7 · Accepted Answer

プーリングデータソースを使用していることがわかっている場合は、警告を無視できます。これは、実装が実装をプールするための通常のインターフェースを実装していないことを意味しますConnectionPoolDataSource。

データベースから直接実行しようとすると、これを高速に実行することはできません。データアクセスが多すぎます。ラップJDBCDataModelインするReloadFromJDBCDataModelと、メモリにキャッシュされます。これは、文字通り100倍高速に動作するはずです。

apache - Mahoutレコメンダーをより速く動作させる方法は？

1 に答える 1

Related

Reference