python - N*N行列を生成するための効率的なアルゴリズム

Question

私は単純な協調フィルタリング（CF）を行っています。アイテム間CFです。たとえば、N個のアイテムを含む巨大な辞書があります。ここで、キーは製品の名前であり、値はそれらを購入した顧客のリストです。

d={
item1:[customer1,customer3,customer7], 
item2:[customer3, customer5],
...
itemN:[customerX...customerY]
}

また、各アイテム間の顧客の類似度を計算する小さな関数もあります。たとえば、item1とitem2：

def littlefunction(...):

    #convert them to a set
    item1=set(d['item1'])
    item2=set(d['item2'])

    commonCustomer=item1.intersect(item2)
    totalCustomer=item1.union(item2)

    similarity=float(len(commonCustomer))/len(totalCustomer)

指定された各アイテムの上位の類似アイテムを取得するには、スキャンしてN回の類似度を計算してから、並べ替える必要があります。したがって、N個のアイテムの場合、複雑さはO(N*N)です。

各アイテムの実行時間は2分になりました（Nは約300万）。完全なN*N類似性テーブルを生成するには、100,000時間かかります。このブルートフォースアプローチよりも優れたアルゴリズムはありますか？各アイテムに必要な結果は上位数件のみです。

score 4 · Accepted Answer

次のような転置インデックスを作成します。

customer1: [item1, item3, item8, ...]
customer2: [item7, item8, item74, ...]

次に、次のことができます。

アイテムを検索して、それを購入した顧客のリストを取得します
各顧客を調べて、顧客が購入したアイテムのリストを取得します

アイテムごとの時間は2分から2秒未満になります。

その2番目のインデックスにはより多くのメモリが必要ですが、データを複製しているわけではありません。また、メモリに問題がある場合は、これを単純なデータベースに保存しても、現在使用しているN^2アルゴリズムよりもはるかに高速です。

より詳しく

任意の2つのアイテム間の類似性を示すN*Nマトリックスを作成するとします。私のテクニックを使用して、次のことを行います。

Create an N*N matrix, and initialize it to 0.
for each item
  Get the list of customers who bought the item (from your item-to-customer index).
  Create an empty dictionary of related items
  for each customer in that list
    for each item that the customer bought
      update the dictionary (add new item, or increase count)
    end for
  end for
  You now have a dictionary that contains the related items,
  and how many customers bought each one. You can update the matrix row
  for the current item from that dictionary.
end for

score 1 · Accepted Answer

これは基本的に@JimMischelの答えと同じ考えであり、明示的な逆インデックスを作成しますが、明示的なNxNマトリックスを作成せず、システム全体のスケッチではなく実装です。

多数のNUM_CUSTOMERS（たとえば、1,000,000）の場合、パフォーマンスはあまり良くありません（3つのsimiliarty_forチェックで約20秒）が、データはランダムです。おそらく、実際の顧客の購入データには、より多くの相関関係とより少ないランダム性があります。

まず、テスト用の合成データを作成します。

import random
import operator

NUM_CUSTOMERS = 10000
NUM_ITEMS = 1000
MIN_PER_CUST = 2
MAX_PER_CUST = 10
NUM_INTEREST = 10

customers = ["customer_{}".format(num) for num in xrange(1, NUM_CUSTOMERS+1)]
items = ["item_{}".format(num) for num in xrange(1, NUM_ITEMS+1)]

customer_items = {
    customer: random.sample(items, random.randint(MIN_PER_CUST, MAX_PER_CUST))
    for customer in customers}

item_customers = {}
for customer, this_items in customer_items.iteritems():
    for item in this_items:
        item_customers.setdefault(item, [])
        item_customers[item].append(customer)

次に、いくつかの関数を定義します。

def similarity(item1, item2):
    item1_set = set(item_customers[item1])
    item2_set = set(item_customers[item2])

    num_common = len(item1_set.intersection(item2_set))
    num_total = len(item1_set.union(item2_set))

    return float(num_common) / num_total

def similarity_for(item):
    to_check = {
        itm for customer in item_customers[item]
            for itm in customer_items[customer] }
    to_check.discard(item)
    rankings = {
        item2: similarity(item, item2)
        for item2 in to_check }
    return sorted(rankings.iteritems(), key=operator.itemgetter(1), reverse=True)

そして、それを実行します：

for index, item in enumerate(sorted(random.sample(items, 3))):
    rankings = similarity_for(item)
    print "check {}; {} (of {}):".format(index, item, len(rankings))
    for item_other, ranking in rankings[:NUM_INTEREST]:
        print "  {}: {}".format(item_other, ranking)

次のような結果を取得します。

% python /tmp/similarity.py
check 0; item_121 (of 309):
  item_520: 0.0283018867925
  item_361: 0.027027027027
  item_536: 0.0265486725664
  item_637: 0.0238095238095
  item_515: 0.020202020202
  item_750: 0.019801980198
  item_960: 0.0192307692308
  item_25: 0.0190476190476
  item_548: 0.018691588785
  item_841: 0.018691588785
check 1; item_714 (of 298):
  item_162: 0.0285714285714
  item_491: 0.0272727272727
  item_617: 0.0265486725664
  item_949: 0.0260869565217
  item_788: 0.0192307692308
  item_446: 0.0190476190476
  item_558: 0.018691588785
  item_606: 0.0181818181818
  item_177: 0.0181818181818
  item_577: 0.018018018018
check 2; item_991 (of 352):
  item_758: 0.0298507462687
  item_204: 0.025
  item_85: 0.0247933884298
  item_769: 0.024
  item_501: 0.0232558139535
  item_860: 0.0227272727273
  item_408: 0.0225563909774
  item_480: 0.0223880597015
  item_73: 0.0220588235294
  item_593: 0.021897810219

python - N*N行列を生成するための効率的なアルゴリズム

2 に答える 2

より詳しく

Related

Reference