python - numpy 再配列に対する効率的な GROUP BY クエリ

Question

purchase_date、user_address、user_id、product_id、brand_id、retailer_id の6 つの列を持つ製品購入ログのデータセットがあります。文字列である user_address を除いて、すべてに整数が含まれます。

データセット全体で最も多くの商品を販売している上位 5 つのブランド、つまり、データのエントリ数が最も多いブランドを取得する必要があります。

SQL では、次のようになると思います (間違っていたら訂正してください)。

SELECT brand_id, COUNT(*)
FROM data
GROUP BY brand_id

次のように、numpy再配列を使用してpythonでそれを試しました：

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])

brands = np.unique(data['brand_id'])    # array of unique brands
for i, brand in enumerate(np.nditer(brands)):     # For any unique brand
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))    # get the number of rows with the current brand

top5 = np.sort(items_sold_per_brand, order='count')[-5:]    # sort the array over the count values
print(top5[::-1])    # print the last five entries

機能しますが、約 12000 の異なるブランドを含む約 100000 行のデータセットで実行するのに約 15 秒かかりますが、これは長すぎるようです。for ループが最も時間がかかります。

おそらくnumpyのrearrayクエリメソッドを使用して、これを行うためのよりエレガントで効率的な方法はありますか?

ご協力いただきありがとうございます！

score 0 · Accepted Answer

指名された重複numpy: 配列内の一意の値の最も効率的な頻度カウントは関連していますが、このコードのいくつかの重要な問題が無視されていると思います。受け入れられた答えは、bincountおそらく役に立たないでしょう。そして、新しいunique return_counts答えを適用するには、多くの人が助けを必要としています.

テストスクリプト:

import numpy as np
# simple data array
data=np.zeros((20,),dtype=[('brand_id',int),('id',int)])
data['brand_id']=np.random.randint(0,10,20)
data['id']=np.arange(20)

items_sold_per_brand = np.empty(len(data), dtype=[('brand_id', 'int'), ('count', 'int')])
brands = np.unique(data['brand_id']) 
print('brands',brands)
for i, brand in enumerate(np.nditer(brands)):
    items_sold_per_brand[i] = (brand, len(data[data['brand_id'] == brand]))
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1]) 

# a bit of simplification
brandids = data['brand_id']
brands = np.unique(brandids)
# only need space for the unique ids 
items_sold_per_brand = np.zeros(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id'] = brands
for i, brand in enumerate(brands):  # dont need nditer
    items_sold_per_brand['count'][i] = (brandids == brand).sum()
top5 = np.sort(items_sold_per_brand, order='count')[-5:]    
print('top5',top5[::-1])    

brands,counts = np.unique(data['brand_id'],return_counts=True)
print('counts',counts)

items_sold_per_brand = np.empty(len(brands), dtype=[('brand_id', 'int'), ('count', 'int')])
items_sold_per_brand['brand_id']=brands
items_sold_per_brand['count']=counts
tops = np.sort(items_sold_per_brand, order='count')[::-1]
print('tops',tops)

ii = np.bincount(data['brand_id'])
print('bin',ii)

生産する

1030:~/mypy$ python3 stack38091849.py 
brands [0 2 3 4 5 6 7 9]
top5 [(99072, 1694566490) (681217, 1510016618) (1694566234, 1180958979)
 (147063168, 147007976) (-1225886932, 139383040)]
top5 [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2)]
counts [3 4 2 1 2 2 4 2]
tops [(7, 4) (2, 4) (0, 3) (9, 2) (6, 2) (5, 2) (3, 2) (4, 1)]
bin [3 0 4 2 1 2 2 4 0 2]

items_sold_per_brand空とのサイズで初期化するとdata、反復中に上書きされないランダムなカウントが残る可能性があります。 zeros小さいbrandsサイズでそれを処理します。

nditerこのような単純な反復には必要ありません。

bincount高速ですが、範囲内のすべての潜在的な値のビンを作成しdataます。したがって、0 サイズのビンが存在する可能性があります。

python - numpy 再配列に対する効率的な GROUP BY クエリ

1 に答える 1

Related

Reference