python - パンダでの複数列の因数分解

Question

pandasfactorize関数は、シリーズ内の各一意の値を 0 から始まる順次インデックスに割り当て、各シリーズエントリが属するインデックスを計算します。

pandas.factorize複数の列で同等のことを達成したい:

import pandas as pd
df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
pd.factorize(df)[0] # would like [0, 1, 2, 2, 1, 0]

つまり、データフレームの複数の列にある一意の値の各タプルを特定し、それぞれに順次インデックスを割り当て、データフレームの各行が属するインデックスを計算したいと考えています。

Factorize単一の列でのみ機能します。パンダに複数列の同等の機能はありますか?

score 1 · Accepted Answer

これが効率的な解決策であるかどうかはわかりません。これにはもっと良い解決策があるかもしれません。

arr=[] #this will hold the unique items of the dataframe
for i in df.index:
   if list(df.iloc[i]) not in arr:
      arr.append(list(df.iloc[i]))

したがって、arrを印刷すると、

>>>print arr
[[1,1],[1,2],[2,2]]

インデックスを保持するには、in および配列を宣言します

ind=[]
for i in df.index:
   ind.append(arr.index(list(df.iloc[i])))

印刷産業が与えるだろう

 >>>print ind
 [0,1,2,2,1,0]

score 0 · Accepted Answer

drop_duplicatesこれらの重複した行を削除するために使用できます

In [23]: df.drop_duplicates()
Out[23]: 
      x  y
   0  1  1
   1  1  2
   2  2  2

編集

目標を達成するために、元の df を drop_duplicated に結合できます。

In [46]: df.join(df.drop_duplicates().reset_index().set_index(['x', 'y']), on=['x', 'y'])
Out[46]: 
   x  y  index
0  1  1      0
1  1  2      1
2  2  2      2
3  2  2      2
4  1  2      1
5  1  1      0

score 0 · Accepted Answer

df = pd.DataFrame({'x': [1, 1, 2, 2, 1, 1], 'y':[1, 2, 2, 2, 2, 1]})
tuples = df[['x', 'y']].apply(tuple, axis=1)
df['newID'] = pd.factorize( tuples )[0]

python - パンダでの複数列の因数分解

4 に答える 4

編集

Related

Reference