python - cartesian product in pandas

Question

I have two pandas dataframes:

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})

What is the best practice to get their cartesian product (of course without writing it explicitly like me)?

#df1, df2 cartesian product
df_cartesian = DataFrame({'col1':[1,2,1,2],'col2':[3,4,3,4],'col3':[5,5,6,6]})

score 139 · Accepted Answer

Pandas の最近のバージョン (>= 1.2) では、これが組み込まれてmergeいるため、次のことができます。

from pandas import DataFrame
df1 = DataFrame({'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'col3':[5,6]})    

df1.merge(df2, how='cross')

これは、以前の pandas < 1.2 の回答と同等ですが、読みやすくなっています。

パンダ < 1.2 の場合:

行ごとに繰り返されるキーがある場合は、(SQL の場合と同様に) マージを使用してデカルト積を生成できます。

from pandas import DataFrame, merge
df1 = DataFrame({'key':[1,1], 'col1':[1,2],'col2':[3,4]})
df2 = DataFrame({'key':[1,1], 'col3':[5,6]})

merge(df1, df2,on='key')[['col1', 'col2', 'col3']]

出力：

   col1  col2  col3
0     1     3     5
1     1     3     6
2     2     4     5
3     2     4     6

ドキュメントについては、http: //pandas.pydata.org/pandas-docs/stable/merging.htmlを参照してください。

score 94 · Accepted Answer

それ以外の場合は空のデータフレームでインデックスとして使用pd.MultiIndex.from_productし、そのインデックスをリセットすれば完了です。

a = [1, 2, 3]
b = ["a", "b", "c"]

index = pd.MultiIndex.from_product([a, b], names = ["a", "b"])

pd.DataFrame(index = index).reset_index()

アウト：

score 38 · Accepted Answer

これはコードゴルフ大会に勝つことはなく、以前の回答から借用していますが、キーがどのように追加され、結合がどのように機能するかを明確に示しています。これにより、リストから 2 つの新しいデータフレームが作成され、デカルト積を実行するためのキーが追加されます。

私の使用例は、リスト内の各週のすべての店舗 ID のリストが必要だったというものでした。そこで、必要なすべての週のリストを作成し、次にマップするすべての店舗 ID のリストを作成しました。

私が選んだマージは left ですが、意味的にはこの設定では inner と同じです。これは、マージに関するドキュメントで確認できます。このドキュメントでは、キーの組み合わせが両方のテーブルに複数回出現する場合、デカルト積を行うと記載されています。これは私たちが設定したものです。

days = pd.DataFrame({'date':list_of_days})
stores = pd.DataFrame({'store_id':list_of_stores})
stores['key'] = 0
days['key'] = 0
days_and_stores = days.merge(stores, how='left', on = 'key')
days_and_stores.drop('key',1, inplace=True)

score 24 · Accepted Answer

With method chaining:

product = (
    df1.assign(key=1)
    .merge(df2.assign(key=1), on="key")
    .drop("key", axis=1)
)

score 16 · Accepted Answer

別の方法として、 itertools: によって提供されるデカルト積に頼ることができます。これによりitertools.product、一時キーの作成やインデックスの変更が回避されます。

import numpy as np 
import pandas as pd 
import itertools

def cartesian(df1, df2):
    rows = itertools.product(df1.iterrows(), df2.iterrows())

    df = pd.DataFrame(left.append(right) for (_, left), (_, right) in rows)
    return df.reset_index(drop=True)

クイックテスト:

In [46]: a = pd.DataFrame(np.random.rand(5, 3), columns=["a", "b", "c"])

In [47]: b = pd.DataFrame(np.random.rand(5, 3), columns=["d", "e", "f"])    

In [48]: cartesian(a,b)
Out[48]:
           a         b         c         d         e         f
0   0.436480  0.068491  0.260292  0.991311  0.064167  0.715142
1   0.436480  0.068491  0.260292  0.101777  0.840464  0.760616
2   0.436480  0.068491  0.260292  0.655391  0.289537  0.391893
3   0.436480  0.068491  0.260292  0.383729  0.061811  0.773627
4   0.436480  0.068491  0.260292  0.575711  0.995151  0.804567
5   0.469578  0.052932  0.633394  0.991311  0.064167  0.715142
6   0.469578  0.052932  0.633394  0.101777  0.840464  0.760616
7   0.469578  0.052932  0.633394  0.655391  0.289537  0.391893
8   0.469578  0.052932  0.633394  0.383729  0.061811  0.773627
9   0.469578  0.052932  0.633394  0.575711  0.995151  0.804567
10  0.466813  0.224062  0.218994  0.991311  0.064167  0.715142
11  0.466813  0.224062  0.218994  0.101777  0.840464  0.760616
12  0.466813  0.224062  0.218994  0.655391  0.289537  0.391893
13  0.466813  0.224062  0.218994  0.383729  0.061811  0.773627
14  0.466813  0.224062  0.218994  0.575711  0.995151  0.804567
15  0.831365  0.273890  0.130410  0.991311  0.064167  0.715142
16  0.831365  0.273890  0.130410  0.101777  0.840464  0.760616
17  0.831365  0.273890  0.130410  0.655391  0.289537  0.391893
18  0.831365  0.273890  0.130410  0.383729  0.061811  0.773627
19  0.831365  0.273890  0.130410  0.575711  0.995151  0.804567
20  0.447640  0.848283  0.627224  0.991311  0.064167  0.715142
21  0.447640  0.848283  0.627224  0.101777  0.840464  0.760616
22  0.447640  0.848283  0.627224  0.655391  0.289537  0.391893
23  0.447640  0.848283  0.627224  0.383729  0.061811  0.773627
24  0.447640  0.848283  0.627224  0.575711  0.995151  0.804567

score 2 · Accepted Answer

重複する列がなく、列を追加したくない場合、およびデータフレームのインデックスを破棄できる場合は、これがより簡単になる可能性があります。

df1.index[:] = df2.index[:] = 0
df_cartesian = df1.join(df2, how='outer')
df_cartesian.index[:] = range(len(df_cartesian))

score 0 · Accepted Answer

現在のバージョンの Pandas (1.1.5) のさらに別の回避策: これは、データフレーム以外のシーケンスから始める場合に特に役立ちます。私はそれを計っていません。人為的なインデックス操作は必要ありませんが、2 番目のシーケンスを繰り返す必要があります。explodeこれは、の特殊なプロパティ、つまり右側のインデックスが繰り返されることに依存しています。

df1 = DataFrame({'col1': [1,2], 'col2': [3,4]})

series2 = Series(
    [[5, 6]]*len(df1),
    name='col3',
    index=df1.index,
)

df_cartesian = df1.join(series2.explode())

これは出力します

   col1  col2 col3
0     1     3    5
0     1     3    6
1     2     4    5
1     2     4    6

python - cartesian product in pandas

13 に答える 13

Related

Reference