python - 複雑な（私にとって）パンダでワイドからロングへの再形成

Question

個人 (0 ～ 5 のインデックス) は、A と B の 2 つの場所から選択します。私のデータには、個人によって異なる特性 (ind_var) と場所によってのみ異なる特性 (location_var) を含む幅広い形式があります。

たとえば、私は持っています：

In [281]:

df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

df_reshape_test

Out[281]:
    dist_to_A   dist_to_B   ind_var location location_var
0    0            50             3   A       10
1    0            50             8   A       10
2    0            50            10   A       10
3    50           0              1   B       14
4    50           0              3   B       14
5    50           0              4   B       14

変数「場所」は、個人によって選択されたものです。dist_to_A は、個人が選択した場所から場所 A までの距離です (dist_to_B と同じ)

データを次の形式にしたい:

    choice  dist_S  ind_var location    location_var
0    1        0       3         A           10
0    0       50       3         B           14
1    1        0       8         A           10
1    0       50       8         B           14
2    1        0      10         A           10
2    0       50      10         B           14
3    0       50       1         A           10
3    1        0       1         B           14
4    0       50       3         A           10
4    1        0       3         B           14
5    0       50       4         A           10
5    1        0       4         B           14

ここで、choice == 1 は個人がその場所を選択したことを示し、dist_S は選択した場所からの距離です。

.stackメソッドについて読みましたが、この場合に適用する方法がわかりませんでした。御時間ありがとうございます！

注: これは単なる例です。私が探しているデータセットには、場所の数と場所ごとの個人の数がさまざまであるため、可能であれば柔軟なソリューションを探しています

score 6 · Accepted Answer

実際、パンダには、wide_to_long意図したことを便利に実行できるコマンドがあります。

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
                'dist_to_A' : [0, 0, 0, 50, 50, 50], 
                'dist_to_B' : [50, 50, 50, 0, 0, 0], 
                'location_var': [10, 10, 10, 14, 14, 14], 
                'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices, 
#record them as dictionaries and drop them 
#(Just realized you had a cleaner way, copied from yous). 

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location') 

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

これにより、要求したテーブルが得られます。

              ind_var  dist_to_ choice  location_var
ind location                                        
0   A               3         0      A            10
    B               3        50      A            14
1   A               8         0      A            10
    B               8        50      A            14
2   A              10         0      A            10
    B              10        50      A            14
3   A               1        50      B            10
    B               1         0      B            14
4   A               3        50      B            10
    B               3         0      B            14
5   A               4        50      B            10
    B               4         0      B            14

もちろん、それが必要な場合は、選択変数を生成でき0ます1。

score 3 · Accepted Answer

なぜフォーマットでそれを好むのか、少し興味があります。おそらく、データを保存するためのはるかに優れた方法があります。しかし、ここに行きます。

In [137]: import numpy as np

In [138]: import pandas as pd

In [139]: df_reshape_test = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B
', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 
0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

In [140]: print(df_reshape_test)
   dist_to_A  dist_to_B  ind_var location  location_var
0          0         50        3        A            10
1          0         50        8        A            10
2          0         50       10        A            10
3         50          0        1        B            14
4         50          0        3        B            14
5         50          0        4        B            14

In [141]: # Get the new axis separately:

In [142]: idx = pd.Index(df_reshape_test.index.tolist() * 2)

In [143]: df2 = df_reshape_test[['ind_var', 'location', 'location_var']].reindex(idx)

In [144]: print(df2)
   ind_var location  location_var
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14
0        3        A            10
1        8        A            10
2       10        A            10
3        1        B            14
4        3        B            14
5        4        B            14

In [145]: # Swap the location for the second half

In [146]: # replace any 6 with len(df) / 2 + 1 if you have more rows.d 

In [147]: df2['choice'] = [1] * 6 + [0] * 6  # may need to play with this.

In [148]: df2.iloc[6:].location.replace({'A': 'B', 'B': 'A'}, inplace=True)

In [149]: df2 = df2.sort()

In [150]: df2['dist_S'] = np.abs((df2.choice - 1) * 50)

In [151]: print(df2)
   ind_var location  location_var  choice  dist_S
0        3        A            10       1       0
0        3        B            10       0      50
1        8        A            10       1       0
1        8        B            10       0      50
2       10        A            10       1       0
2       10        B            10       0      50
3        1        B            14       1       0
3        1        A            14       0      50
4        3        B            14       1       0
4        3        A            14       0      50
5        4        B            14       1       0
5        4        A            14       0      50

うまく一般化することはできませんが、選択 col を生成するなど、醜い部分を回避する別の (より良い) 方法がおそらくあるでしょう。

score 2 · Accepted Answer

わかりました、これは私が予想したよりも時間がかかりましたが、これは個人ごとに任意の数の選択肢で機能するより一般的な答えです. もっと簡単な方法があると確信しているので、次のコードのいくつかについて誰かがより良い方法でチャイムを鳴らしてくれるとうれしいです.

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 'dist_to_A' : [0, 0, 0, 50, 50, 50], 'dist_to_B' : [50, 50, 50, 0, 0, 0], 'location_var': [10, 10, 10, 14, 14, 14], 'ind_var': [3, 8, 10, 1, 3, 4]})

を与える

    dist_to_A   dist_to_B   ind_var location   location_var
0    0           50          3     A            10
1    0           50          8     A            10
2    0           50         10     A            10
3    50          0           1     B            14
4    50          0           3     B            14
5    50          0           4     B            14

次に、次のことを行います。

df.index.names = ['ind']

# Add choice var

df['choice'] = 1

# Create dictionaries we'll use later

ind_to_loc = dict(df['location'])
# gives ind_to_loc equal to {0 : 'A', 1 : 'A', 2 : 'A', 3 : 'B', 4 : 'B', 5: 'B'}

ind_dict = dict(df['ind_var'])
#gives  { 0: 3, 1 : 8, 2 : 10, 3: 1, 4 : 3, 5: 4}

loc_dict = dict(  df.groupby('location').agg(lambda x : int(np.mean(x)) )['location_var']  )
# gives  {'A' : 10, 'B' : 14}

次に、マルチインデックスを作成し、インデックスを再作成して長い形状を取得します

df = df.set_index( [df.index, df['location']] )

df.index.names = ['ind', 'location']

# re-index to long shape

loc_list = ['A', 'B']
ind_list = [0, 1, 2, 3, 4, 5]
new_shape = [  (ind, loc) for ind in ind_list for loc in loc_list]
idx = pd.Index(new_shape)
df_long = df.reindex(idx, method = None)
df_long.index.names = ['ind', 'loc']

長い形状は次のようになります。

         dist_to_A  dist_to_B  ind_var location  location_var  choice
ind loc                                                              
0   A            0         50        3        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
1   A            0         50        8        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
2   A            0         50       10        A            10       1
    B          NaN        NaN      NaN      NaN           NaN     NaN
3   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        1        B            14       1
4   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        3        B            14       1
5   A          NaN        NaN      NaN      NaN           NaN     NaN
    B           50          0        4        B            14       1

NaN 値に辞書を入力します。

df_long['ind_var'] = df_long.index.map(lambda x : ind_dict[x[0]] )
df_long['location']  = df_long.index.map(lambda x : ind_to_loc[x[0]] )
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]] )

# Fill in choice
df_long['choice'] = df_long['choice'].fillna(0)

最後に、あとは dist_S
を作成するだけです

nested_loc = {'A' : {'A' : 0, 'B' : 50}, 'B' : {'A' : 50, 'B' : 0}}

(これは次のように表示されます: 場所 A にいる場合、場所 A は 0 km、場所 B は 50 km です)

def nested_f(x):    
    return nested_loc[x[0]][x[1]]

df_long = df_long.reset_index()
df_long['dist_S'] = df_long[['loc', 'location']].apply(nested_f, axis=1)

df_long = df_long.drop(['dist_to_A', 'dist_to_B', 'location'], axis = 1 )

df_long

望ましい結果を与える

    ind loc ind_var location_var    choice  dist_S
0    0   A   3         10            1      0
1    0   B   3         14            0      50
2    1   A   8         10            1      0
3    1   B   8         14            0      50
4    2   A   10        10            1      0
5    2   B   10        14            0      50
6    3   A   1         10            0      50
7    3   B   1         14            1      0
8    4   A   3         10            0      50
9    4   B   3         14            1      0
10   5   A   4         10            0      50
11   5   B   4         14            1      0

python - 複雑な（私にとって）パンダでワイドからロングへの再形成

3 に答える 3

Related

Reference