python - ワンホットエンコーディング

Question

次のような csv ファイルがあります。

text short_text category
...  ...        ...

ファイルを開き、次のように Pandas データフレームに保存しました。

filepath = 'path/data.csv'
train = pd.read_csv(filepath, header=0, delimiter=",")

各レコードのカテゴリフィールドには、カテゴリのリストが含まれています。これは文字列であり、各カテゴリは次のように一重引用符で囲まれています。

['Adult'   'Aged'   'Aged   80 and over'   'Benzhydryl Compounds/*therapeutic use'   'Cresols/*therapeutic use'   'Double-Blind Method'   'Female'   'Humans'   'Male'   'Middle Aged'   'Muscarinic Antagonists/*therapeutic use'   '*Phenylpropanolamine'   'Tolterodine Tartrate'   'Urinary Incontinence/*drug therapy']

これをワンホットエンコーディングで機械学習に使いたい。scikit-learn の sklearn.preprocessing パッケージを使用してこれを実装できることは理解していますが、これを行う方法がわかりません。

注: 考えられるすべてのカテゴリのリストはありません。

score 0 · Accepted Answer

あなたはpd.value_counts助けるために使用することができます

df = pd.DataFrame(dict(
        text=list('ABC'),
        short_text=list('XYZ'),
        category=[list('abc'), list('def'), list('abefxy')]
    ))

df.category.apply(pd.value_counts).fillna(0).astype(int)

またはすべて一緒に

pd.concat(
    [df.drop('category', 1),
     df.category.apply(pd.value_counts).fillna(0).astype(int)],
    axis=1
)

score 0 · Accepted Answer

piRSquared の answer の代わりに、を使用できますsklearn.preprocessing.MultiLabelBinarizer。

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
pd.concat([
    df.drop('category', 1),
    pd.DataFrame(mlb.fit_transform(df['category']), columns=mlb.classes_),
], 1)

私のテストでは、特に大規模なデータセットの場合、これは数桁高速でした。

python - ワンホット エンコーディング

2 に答える 2

Related

Reference

python - ワンホットエンコーディング