python - Pandas データフレームの 2 つの列に関数を適用する方法

Question

dfの列を持つがあるとします'ID', 'col_1', 'col_2'。そして、関数を定義します:

f = lambda x, y : my_function_expression.

f次に、 todfの 2 つの列を適用して、'col_1', 'col_2'要素ごとに新しい column'col_3'を計算します。次のようになります。

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

実行する方法？

**以下のように詳細サンプルを追加します***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

score 447 · Accepted Answer

applyで呼び出しているデータフレームを使用した例を次に示しaxis = 1ます。

違いは、 function に 2 つの値を渡そうとする代わりにf、 pandas Series オブジェクトを受け入れるように関数を書き直してから、 Series にインデックスを付けて必要な値を取得することです。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

ユースケースによっては、pandasgroupオブジェクトを作成applyしてグループで使用すると便利な場合があります。

score 142 · Accepted Answer

簡単な解決策は次のとおりです。

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

score 27 · Accepted Answer

探しているメソッドは Series.combine です。ただし、データ型については注意が必要なようです。あなたの例では、（答えをテストしたときに行ったように）単純に呼び出します

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

ただし、これによりエラーがスローされます。

ValueError: setting an array element with a sequence.

私の最善の推測は、結果がメソッドを呼び出すシリーズと同じタイプ（ここでは df.col_1 ）になることを期待しているようだということです。ただし、次のように動作します。

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

score 14 · Accepted Answer

np.vectorize に投票します。関数内のデータフレームを処理せずに x 個の列を撮影することができるため、制御しない関数や、2 つの列と定数を関数に送信するようなことを行う場合に最適です (つまり、col_1、col_2、 'フー')。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

score 12 · Accepted Answer

あなたが書いた方法では、2つの入力が必要です。エラーメッセージを見ると、f に 2 つの入力を提供していません。1 つだけです。エラーメッセージは正しいです。
不一致は、df[['col1','col2']] が 2 つの個別の列ではなく、2 つの列を持つ単一のデータフレームを返すためです。

単一の入力を受け取るように f を変更し、上記のデータフレームを入力として保持し、関数本体内でx、y に分割する必要があります。次に、必要なことをすべて行い、単一の値を返します。

構文が .apply(f) であるため、この関数シグネチャが必要です。したがって、f は、現在の f が期待する 2 つのものではなく、1 つの = データフレームを取る必要があります。

f の本体を提供していないため、これ以上詳細を説明することはできませんが、これにより、コードを根本的に変更したり、適用するのではなく他の方法を使用したりすることなく、解決策が提供されるはずです

score 11 · Accepted Answer

これは Pandas や Numpy 操作を使用したソリューションほど高速ではないと確信していますが、関数を書き直したくない場合は map を使用できます。元のサンプルデータの使用 -

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

このようにして、必要な数の引数を関数に渡すことができました。出力は私たちが望んでいたものです

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

score 5 · Accepted Answer

別のオプションは次のとおりです（一般的に高速であり、ドキュメントとユーザーテスト df.itertuples()で推奨さdf.iterrows()れています）：

import pandas as pd

df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))

df
    a   b   c   d
0   0   1   2   3
1   0   1   2   3
2   0   1   2   3
3   0   1   2   3


df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]

df
    a   b   c   d   e
0   0   1   2   3   4
1   0   1   2   3   4
2   0   1   2   3   4
3   0   1   2   3   4

はのをitertuples返すため、列名 (別名ドット表記) とインデックスの両方で属性としてタプル要素にアクセスできます。Iterablenamedtuple

b, d = row
b = row.b
d = row[1]

score 1 · Accepted Answer

より高速なソリューションは次のとおりです。

def func_1(a,b):
    return a + b

df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())

これは、@Aman の 380 倍df.apply(f, axis=1)、@ajrwhite の 310 倍高速df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)です。

いくつかのベンチマークも追加します。

結果：

  FUNCTIONS   TIMINGS   GAIN
apply lambda    0.7     x 1
apply           0.56    x 1.25
map             0.3     x 2.3
np.vectorize    0.01    x 70
f3 on Series    0.0026  x 270
f3 on np arrays 0.0018  x 380
f3 numba        0.0018  x 380

要するに：

apply の使用は遅いです。Pandas シリーズ (または numpy 配列) で直接動作する関数を使用するだけで、非常に簡単に高速化できます。また、Pandas シリーズまたは numpy 配列を操作するため、操作をベクトル化できます。この関数は、新しい列として割り当てる Pandas Series または numpy 配列を返します。

そして、これがベンチマークコードです：

import timeit

timeit_setup = """
import pandas as pd
import numpy as np
import numba

np.random.seed(0)

# Create a DataFrame of 10000 rows with 2 columns "A" and "B" 
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])

def f1(a,b):
    # Here a and b are the values of column A and B for a specific row: integers
    return a + b

def f2(x):
    # Here, x is pandas Series, and corresponds to a specific row of the DataFrame
    # 0 and 1 are the indexes of columns A and B
    return x[0] + x[1]  

def f3(a,b):
    # Same as f1 but we will pass parameters that will allow vectorization
    # Here, A and B will be Pandas Series or numpy arrays
    # with df["C"] = f3(df["A"],df["B"]): Pandas Series
    # with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
    return a + b

@numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
    # Here a and b are 2 numpy arrays with dtype int64
    # This function must return a numpy array whith dtype int64
    return a + b

"""

test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]


for test_function in test_functions:
    print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))

出力：

最後の注意: Cython やその他の numba トリックでも最適化できます。

python - Pandas データフレームの 2 つの列に関数を適用する方法

14 に答える 14

Related

Reference