python - 1 つのセルに複数の値を持つ 1 つのホットエンコーディングを行うにはどうすればよいですか?

Question

私はExcelでこのテーブルを持っています：

id  class
0   2 3
1   1 3 
2   3 5

ここで、Python で「特別な」ワンホットエンコーディングを実行したいと考えています。最初のテーブルの各 ID には、2 つの数字があります。各番号はクラス (class1、class2 など) に対応しています。2 番目のテーブルは最初のテーブルに基づいて作成され、各 ID について、その行の各数値が対応するクラス列に表示され、他の列はゼロになります。たとえば、id 0 の番号は 2 と 3 です。2 は class2 に配置され、3 は class3 に配置されます。クラス 1、4、および 5 はデフォルトの 0 を取得します。結果は次のようになります。

id  class1  class2  class3  class4  class5
 0   0       2        3       0       0
 1   1       0        3       0       0
 2   0       0        3       0       5

私の以前の解決策、

foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])

結果:

    class_1 class_2 class_3 class_3 class_5
  0  0.0     1.0    0.0      1.0    0.0
  1  1.0     0.0    0.0      1.0    0.0
  2  0.0     0.0    1.0      0.0    1.0

(class_3 が 2 回表示されます)。これを修正するにはどうすればよいですか? (このステップの後、必要な最終的な形式に変換できます。)

score 4 · Accepted Answer

変数を次のようにする必要があります。categorical次に、次のように使用できますone hot encoding。

In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})

In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})

In [20]: df_1 = pd.get_dummies(df1)

In [21]: df_2 = pd.get_dummies(df2)

In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]: 
    class_1  class_2  class_3  class_4  class_5
id                                             
0         0        2        3        0        0
1         1        0        3        0        0
2         0        0        3        0        5

score 3 · Accepted Answer

元のデータフレームを 3 つの列に分割する方が簡単な場合があります。

id  class_a class_b
0   2          3
1   1          3  
2   3          5

そして、その上で通常のワンホットエンコーディングを実行します。その後、次のような列の重複が発生する可能性があります。

id  ... class_a_3 class_b_3 ... class_b_5
0          0          1             0
1          0          1             0
2          1          0             0

しかし、事後にそれらを非常に簡単にマージ/合計できます。

同様に、同じロジックをピボットして、df を次の形式に変換できます。

次に、それをワンホットし、キー ID の合計を使用して集計します。

python - 1 つのセルに複数の値を持つ 1 つのホット エンコーディングを行うにはどうすればよいですか?

4 に答える 4

Related

Reference

python - 1 つのセルに複数の値を持つ 1 つのホットエンコーディングを行うにはどうすればよいですか?