python - パンダの列タイプを変更する

Question

リストのリストとして表されるテーブルをPandas DataFrame. 非常に単純化された例として：

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a)

列を適切な型に変換する最良の方法は何ですか?この場合、列 2 と 3 を float に変換しますか? DataFrame への変換中に型を指定する方法はありますか? または、最初に DataFrame を作成してから、列をループして各列の型を変更する方がよいでしょうか? 何百もの列が存在する可能性があり、どの列がどのタイプであるかを正確に指定したくないため、動的な方法でこれを行うのが理想的です。私が保証できるのは、各列に同じ型の値が含まれていることだけです。

score 497 · Accepted Answer

これはどう？

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])
df
Out[16]: 
  one  two three
0   a  1.2   4.2
1   b   70  0.03
2   x    5     0

df.dtypes
Out[17]: 
one      object
two      object
three    object

df[['two', 'three']] = df[['two', 'three']].astype(float)

df.dtypes
Out[19]: 
one       object
two      float64
three    float64

score 6 · Accepted Answer

df.info() は、float64 である temp の初期データ型を提供します

 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    132 non-null    object 
 1   temp    132 non-null    float64

次に、次のコードを使用してデータ型を int64 に変更します。

df['temp'] = df['temp'].astype('int64')

df.info() を再度実行すると、次のように表示されます。

  #   Column  Non-Null Count  Dtype 
 ---  ------  --------------  ----- 
  0   date    132 non-null    object
  1   temp    132 non-null    int64

これは、列 temp のデータ型が正常に変更されたことを示しています。ハッピーコーディング！

score 0 · Accepted Answer

この 74 個のオブジェクト列と 2 つの Int 列のデータフレームのようなさまざまなオブジェクト列がある場合、各値には単位を表す文字があります。

import pandas as pd 
import numpy as np
dataurl = 'https://raw.githubusercontent.com/RubenGavidia/Pandas_Portfolio.py/main/Wes_Mckinney.py/nutrition.csv'
nutrition = pd.read_csv(dataurl,index_col=[0])
nutrition.head(3)

    name    serving_size    calories    total_fat   saturated_fat   cholesterol sodium  choline folate  folic_acid  ... fat saturated_fatty_acids   monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine    theobromine water
0   Cornstarch  100 g   381 0.1g    NaN 0   9.00 mg 0.4 mg  0.00 mcg    0.00 mcg    ... 0.05 g  0.009 g 0.016 g 0.025 g 0.00 mg 0.0 g   0.09 g  0.00 mg 0.00 mg 8.32 g
1   Nuts, pecans    100 g   691 72g 6.2g    0   0.00 mg 40.5 mg 22.00 mcg   0.00 mcg    ... 71.97 g 6.180 g 40.801 g    21.614 g    0.00 mg 0.0 g   1.49 g  0.00 mg 0.00 mg 3.52 g
2   Eggplant, raw   100 g   25  0.2g    NaN 0   2.00 mg 6.9 mg  22.00 mcg   0.00 mcg    ... 0.18 g  0.034 g 0.016 g 0.076 g 0.00 mg 0.0 g   0.66 g  0.00 mg 0.00 mg 92.30 g
3 rows × 76 columns

nutrition.dtypes
name             object
serving_size     object
calories          int64
total_fat        object
saturated_fat    object
                  ...  
alcohol          object
ash              object
caffeine         object
theobromine      object
water            object
Length: 76, dtype: object

nutrition.dtypes.value_counts()
object    74
int64      2
dtype: int64

すべての列を数値に変換する良い方法は、正規表現を使用して単位を何も置き換えず、 astype(float) を使用して列のデータ型を float に変更することです。

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)
nutrition.head(3)

serving_size    calories    total_fat   saturated_fat   cholesterol sodium  choline folate  folic_acid  niacin  ... fat saturated_fatty_acids   monounsaturated_fatty_acids polyunsaturated_fatty_acids fatty_acids_total_trans alcohol ash caffeine    theobromine water
name                                                                                    
Cornstarch  100.0   381.0   0.1 NaN 0.0 9.0 0.4 0.0 0.0 0.000   ... 0.05    0.009   0.016   0.025   0.0 0.0 0.09    0.0 0.0 8.32
Nuts, pecans    100.0   691.0   72.0    6.2 0.0 0.0 40.5    22.0    0.0 1.167   ... 71.97   6.180   40.801  21.614  0.0 0.0 1.49    0.0 0.0 3.52
Eggplant, raw   100.0   25.0    0.2 NaN 0.0 2.0 6.9 22.0    0.0 0.649   ... 0.18    0.034   0.016   0.076   0.0 0.0 0.66    0.0 0.0 92.30
3 rows × 75 columns

nutrition.dtypes
serving_size     float64
calories         float64
total_fat        float64
saturated_fat    float64
cholesterol      float64
                  ...   
alcohol          float64
ash              float64
caffeine         float64
theobromine      float64
water            float64
Length: 75, dtype: object

nutrition.dtypes.value_counts()
float64    75
dtype: int64

これでデータセットはきれいになり、正規表現と astype() のみを使用して、このデータフレームで数値演算を実行できるようになりました。

ユニットを収集してヘッダーに貼り付けたい場合は、次のcholesterol_mgコードを使用できます。

nutrition.index = pd.RangeIndex(start = 0, stop = 8789, step= 1)
nutrition.set_index('name',inplace = True)
nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = nutrition.astype(str).replace('[^a-zA-Z]','', regex= True)
units = units.mode()
units = units.replace('', np.nan).dropna(axis=1)
mapper = { k: k + "_" + units[k].at[0] for k in units}
nutrition.rename(columns=mapper, inplace=True)
nutrition.replace('[a-zA-Z]','', regex= True, inplace=True)
nutrition=nutrition.astype(float)

python - パンダの列タイプを変更する

12 に答える 12

Related

Reference