python - パンダが read_csv で型を自動的に推測しないようにする

Question

# で区切られた 3 つの列を持つファイルがあります。最初の列は整数、2 番目は float のように見えますがそうではなく、3 番目は文字列です。これをPythonに直接ロードしようとしますpandas.read_csv

In [149]: d = pandas.read_csv('resources/names/fos_names.csv',  sep='#', header=None, names=['int_field', 'floatlike_field', 'str_field'])

In [150]: d
Out[150]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1673 entries, 0 to 1672
Data columns:
int_field          1673  non-null values
floatlike_field    1673  non-null values
str_field          1673  non-null values
dtypes: float64(1), int64(1), object(1)

pandas賢く、自動的にフィールドを便利な型に変換しようとします。問題は、実際にはそうさせたくないということです（そうする場合は、converters引数を使用していました）。pandas型が自動的に変換されないようにするにはどうすればよいですか?

score 12 · Accepted Answer

pandas 0.10 での今後のファイルパーサーエンジンのオーバーホールで、明示的な列 dtype を追加する予定です。これに 100% コミットすることはできませんが、新しいインフラストラクチャが完成すればかなりシンプルになるはずです (http://wesmckinney.com/blog/?p=543)。

score 7 · Accepted Answer

あなたの最善の策は、最初にnumpyを使用してデータをレコード配列として読み取ることだと思います。

# what you described:
In [15]: import numpy as np
In [16]: import pandas
In [17]: x = pandas.read_csv('weird.csv')

In [19]: x.dtypes
Out[19]: 
int_field            int64
floatlike_field    float64  # what you don't want?
str_field           object

In [20]: datatypes = [('int_field','i4'),('floatlike','S10'),('strfield','S10')]

In [21]: y_np = np.loadtxt('weird.csv', dtype=datatypes, delimiter=',', skiprows=1)

In [22]: y_np
Out[22]: 
array([(1, '2.31', 'one'), (2, '3.12', 'two'), (3, '1.32', 'three ')], 
      dtype=[('int_field', '<i4'), ('floatlike', '|S10'), ('strfield', '|S10')])

In [23]: y_pandas = pandas.DataFrame.from_records(y_np)

In [25]: y_pandas.dtypes
Out[25]: 
int_field     int64
floatlike    object  # better?
strfield     object

python - パンダが read_csv で型を自動的に推測しないようにする

2 に答える 2

Related

Reference