ここでのゲームには少し遅れましたが、HYRY のソリューションでいくつかのパフォーマンスの問題が発生していたため、別のソリューションを考え出す必要がありました。
これは、各キー値の頻度を見つけてから、各キーについて、最も頻繁に表示される値のみを保持することによって機能します。
複数のモードをサポートする追加のソリューションもあります。
私が扱っているデータを代表するスケール テストでは、実行時間が 37.4 秒から 0.5 秒に短縮されました。
ソリューションのコード、使用例、スケール テストは次のとおりです。
import numpy as np
import pandas as pd
import random
import time
test_input = pd.DataFrame(columns=[ 'key', 'value'],
data= [[ 1, 'A' ],
[ 1, 'B' ],
[ 1, 'B' ],
[ 1, np.nan ],
[ 2, np.nan ],
[ 3, 'C' ],
[ 3, 'C' ],
[ 3, 'D' ],
[ 3, 'D' ]])
def mode(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the mode.
The output is a DataFrame with a record per group that has at least one mode
(null values are not counted). The `key_cols` are included as columns, `value_col`
contains a mode (ties are broken arbitrarily and deterministically) for each
group, and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)
def modes(df, key_cols, value_col, count_col):
'''
Pandas does not provide a `mode` aggregation function
for its `GroupBy` objects. This function is meant to fill
that gap, though the semantics are not exactly the same.
The input is a DataFrame with the columns `key_cols`
that you would like to group on, and the column
`value_col` for which you would like to obtain the modes.
The output is a DataFrame with a record per group that has at least
one mode (null values are not counted). The `key_cols` are included as
columns, `value_col` contains lists indicating the modes for each group,
and `count_col` indicates how many times each mode appeared in its group.
'''
return df.groupby(key_cols + [value_col]).size() \
.to_frame(count_col).reset_index() \
.groupby(key_cols + [count_col])[value_col].unique() \
.to_frame().reset_index() \
.sort_values(count_col, ascending=False) \
.drop_duplicates(subset=key_cols)
print test_input
print mode(test_input, ['key'], 'value', 'count')
print modes(test_input, ['key'], 'value', 'count')
scale_test_data = [[random.randint(1, 100000),
str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
scale_test_input = pd.DataFrame(columns=['key', 'value'],
data=scale_test_data)
start = time.time()
mode(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
modes(scale_test_input, ['key'], 'value', 'count')
print time.time() - start
start = time.time()
scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
print time.time() - start
このコードを実行すると、次のように出力されます。
key value
0 1 A
1 1 B
2 1 B
3 1 NaN
4 2 NaN
5 3 C
6 3 C
7 3 D
8 3 D
key value count
1 1 B 2
2 3 C 2
key count value
1 1 2 [B]
2 3 2 [C, D]
0.489614009857
9.19386196136
37.4375009537
お役に立てれば!