python - python pandas、DF.groupby（）。agg（）、agg（）の列参照

Question

具体的な問題については、DataFrameDFがあるとします。

     word  tag count
0    a     S    30
1    the   S    20
2    a     T    60
3    an    T    5
4    the   T    10

すべての「単語」について、「カウント」が最も多い「タグ」を見つけたいと思います。したがって、リターンは次のようになります

     word  tag count
1    the   S    20
2    a     T    60
3    an    T    5

カウント列や、順序/インデックスがオリジナルであるか混乱しているのかは気にしません。辞書を返す{ 'the'：'S'、...}は問題ありません。

できるといいのですが

DF.groupby(['word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )

しかし、それは機能しません。列情報にアクセスできません。

より抽象的には、agg（function ）の関数はその引数として何を見ますか？

ところで、.agg（）は.aggregate（）と同じですか？

どうもありがとう。

score 71 · Accepted Answer

aggと同じaggregateです。呼び出し可能は、の列（Seriesオブジェクト）をDataFrame一度に1つずつ渡されます。

idxmax最大数の行のインデックスラベルを収集するために使用できます。

idx = df.groupby('word')['count'].idxmax()
print(idx)

収量

word
a       2
an      3
the     1
Name: count

次に、を使用して、および列locの行を選択します。wordtag

print(df.loc[idx, ['word', 'tag']])

収量

  word tag
2    a   T
3   an   T
1  the   S

インデックスラベルidxmaxを返すことに注意してください。ラベルで行を選択するために使用できます。ただし、インデックスが一意でない場合、つまり、インデックスラベルが重複している行がある場合は、にリストされているラベルを持つすべての行が選択されます。と一緒に使用する場合は注意してくださいdf.locdf.locidxdf.index.is_uniqueTrueidxmaxdf.loc

または、を使用することもできますapply。applyの呼び出し可能オブジェクトには、すべての列へのアクセスを提供するサブDataFrameが渡されます。

import pandas as pd
df = pd.DataFrame({'word':'a the a an the'.split(),
                   'tag': list('SSTTT'),
                   'count': [30, 20, 60, 5, 10]})

print(df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

収量

word
a       T
an      T
the     S

を使用するidxmaxと、特に大きなDataFrameの場合、loc通常はより高速になります。applyIPythonの％timeitの使用：

N = 10000
df = pd.DataFrame({'word':'a the a an the'.split()*N,
                   'tag': list('SSTTT')*N,
                   'count': [30, 20, 60, 5, 10]*N})
def using_apply(df):
    return (df.groupby('word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

def using_idxmax_loc(df):
    idx = df.groupby('word')['count'].idxmax()
    return df.loc[idx, ['word', 'tag']]

In [22]: %timeit using_apply(df)
100 loops, best of 3: 7.68 ms per loop

In [23]: %timeit using_idxmax_loc(df)
100 loops, best of 3: 5.43 ms per loop

単語をタグにマッピングする辞書が必要な場合は、次のように使用できset_index ますto_dict。

In [36]: df2 = df.loc[idx, ['word', 'tag']].set_index('word')

In [37]: df2
Out[37]: 
     tag
word    
a      T
an     T
the    S

In [38]: df2.to_dict()['tag']
Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

score 18 · Accepted Answer

渡されたもの (unutbu) のソリューションを「適用」する簡単な方法を次に示します。

In [33]: def f(x):
....:     print type(x)
....:     print x
....:     

In [34]: df.groupby('word').apply(f)
<class 'pandas.core.frame.DataFrame'>
  word tag  count
0    a   S     30
2    a   T     60
<class 'pandas.core.frame.DataFrame'>
  word tag  count
0    a   S     30
2    a   T     60
<class 'pandas.core.frame.DataFrame'>
  word tag  count
3   an   T      5
<class 'pandas.core.frame.DataFrame'>
  word tag  count
1  the   S     20
4  the   T     10

関数は、グループ化された変数がすべて同じ値 (この cas 'word' 内) を持つフレームのサブセクションで (この場合) 動作するだけです。関数を渡す場合は、集計を処理する必要があります。非文字列の可能性がある列。「sum」などの標準関数がこれを行います

文字列列で自動的に集計しません

In [41]: df.groupby('word').sum()
Out[41]: 
      count
word       
a        90
an        5
the      30

すべての列で集計しています

In [42]: df.groupby('word').apply(lambda x: x.sum())
Out[42]: 
        word tag count
word                  
a         aa  ST    90
an        an   T     5
the   thethe  ST    30

関数内でほとんど何でもできます

In [43]: df.groupby('word').apply(lambda x: x['count'].sum())
Out[43]: 
word
a       90
an       5
the     30

python - python pandas、DF.groupby（）。agg（）、agg（）の列参照

2 に答える 2

Related

Reference