python - pandasまたはblazeを使用して、非常に大きなCSVファイルから列を削除します

Question

非常に大きな csv ファイル (5 GB) があるため、すべてをメモリにロードしたくなく、1 つまたは複数の列を削除したいと考えています。次のコードを blaze で使用してみましたが、結果の列を既存の csv ファイルに追加するだけでした。

from blaze import Data, odo
d = Data("myfile.csv")
d = d[columns_I_want_to_keep]
odo(d, "myfile.csv")

パンダまたはブレイズのいずれかを使用して、必要な列のみを保持し、他の列を削除する方法はありますか?

score 6 · Accepted Answer

構文的にパンダに似ているを使用できますがdask.dataframe、コア外で操作を行うため、メモリは問題になりません。また、プロセスを自動的に並列化するため、高速になるはずです。

import dask.dataframe as dd

df = dd.read_csv('myfile.csv', usecols=['col1', 'col2', 'col3'])
df.to_csv('output.csv', index=False)

タイミング

これまでに投稿された各メソッドの時間を 1.4 GB の csv ファイルに記録しました。出力 csv ファイルを 250 MB のままにして、4 つの列を保持しました。

ダスクの使用:

%%timeit
df = dd.read_csv(f_in, usecols=cols_to_keep)
df.to_csv(f_out, index=False)

1 loop, best of 3: 41.8 s per loop

パンダの使用:

%%timeit
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
    chunk.to_csv(f_out, mode='a', index=False)

1 loop, best of 3: 44.2 s per loop

Python/CSV の使用:

%%timeit
inc_f = open(f_in, 'r')
csv_r = csv.reader(inc_f)
out_f = open(f_out, 'w')
csv_w = csv.writer(out_f, delimiter=',', lineterminator='\n')
for row in csv_r:
    new_row = [row[1], row[5], row[6], row[8]]
    csv_w.writerow(new_row)
inc_f.close()
out_f.close()

1 loop, best of 3:  1min 1s per loop

score 2 · Accepted Answer

私はこのようにします：

cols2keep = ['col1','col3','col4','col6'] # columns you want to have in the resulting CSV file
chunksize = 10**5  # you may want to adjust it ... 
for chunk in pd.read_csv(filename, chunksize=chunksize, usecols=cols2keep):
    chunk.to_csv('output.csv', mode='a', index=False)

PS 適切な場合は、CSV から PyTables (HDF5) への移行を検討することもできます...

score 1 · Accepted Answer

大きなcsvファイルをよく扱います。これが私の解決策です：

import csv
fname_in = r'C:\mydir\myfile_in.csv' 
fname_out = r'C:\mydir\myfile_out.csv' 
inc_f = open(fname_in,'r')  #open the file for reading
csv_r = csv.reader(inc_f) # Attach the csv "lens" to the input stream - default is excel dialect
out_f = open(fname_out,'w') #open the file for writing
csv_w = csv.writer(out_f, delimiter=',',lineterminator='\n' ) #attach the csv "lens" to the stream headed to the output file
for row in csv_r: #Loop Through each row in the input file
    new_row = row[:]  # initialize the output row
    new_row.pop(5) #Whatever column you wanted to delete
    csv_w.writerow(new_row) 
inc_f.close()
out_f.close()

python - pandasまたはblazeを使用して、非常に大きなCSVファイルから列を削除します

4 に答える 4

Related

Reference