python - Pandas でグループ列を操作する

Question

列 Dist、Class、および Count を含むデータセットがあります。

そのデータセットを dist でグループ化し、各グループのカウント列をそのグループのカウントの合計で割ります (1 に正規化します)。

次の MWE は、これまでの私のアプローチを示しています。しかし、私は疑問に思います:これを書くためのよりコンパクト/パンダ的な方法はありますか?

import pandas as pd
import numpy as np

a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])

def manipcolumn(x):
    csum = x['Count'].sum()
    x['Count'] = x['Count'].apply(lambda x: x/csum)
    return x

s.groupby('Dist').apply(manipcolumn)

score 2 · Accepted Answer

正規化された「カウント」列を取得する別の方法の 1 つは、使用groupbyしtransformて各グループの合計を取得し、返されたシリーズを「カウント」列で割ることです。このシリーズを DataFrame に再割り当てできます。

s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)

これにより、特注の Python 関数とapply. 質問の小さなサンプル DataFrame でテストすると、約 8 倍高速であることがわかりました。

python - Pandas でグループ列を操作する

1 に答える 1

Related

Reference