python - HDFStore: table.select と RAM の使用量

Question

約 1 GB の HDFStore テーブルからランダムな行を選択しようとしています。約 50 のランダムな行を要求すると、RAM の使用量が爆発的に増加します。

私はパンダを使用しています0-11-dev, python 2.7, linux64。

この最初のケースでは、RAM 使用量は次のサイズに適合します。chunk

with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
    pass

この 2 番目のケースでは、テーブル全体が RAM にロードされているようです。

r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))

この最後のケースでは、RAM 使用量は同等のchunkサイズに適合します。

r=random.choice(400000,size=30,replace=False)    
train.select('train',pd.Term("index",r))

なぜ 30 行から 40 行にランダムに移動すると、RAM 使用量が劇的に増加するのか、私は困惑しています。

テーブルは、次のコードを使用して index=range(nrows(table)) のように作成されたときにインデックス付けされていることに注意してください。

def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

洞察をありがとう

編集して答える

これは、Train.csv を train.h5 に書き込むために使用したファイルです。これは、 How to trouble-shoot HDFStore Exception: cannot find the correct atom typeの Zelazny7 のコードの要素を使用して作成しました。

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer


def object_max_len(x):
    if x.dtype != 'object':
        return
    else:
        return len(max(x.fillna(''), key=lambda x: len(str(x))))

def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
    max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
    dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes

    for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
        max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
        for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
            if (k[0] != k[1]) and (k[1] == 'object'):
                dtypes0[i] = k[1]
    #as of pandas-0.11 nan requires a float64 dtype
    dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
    return max_len, dtypes0


def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
    max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)

    with pd.get_store( storefile,'w') as store:
        for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
            chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
            store.append(table_name,chunk, min_itemsize={'values':max_len})

として適用

txtfile2hdfstore('Train.csv','train.h5','train',sep=',')

score 6 · Accepted Answer

これは既知の問題です。こちらのリファレンスを参照してください: https://github.com/pydata/pandas/pull/2755

基本的に、クエリはnumexpr評価用の式に変換されます。多くのor条件を numexpr に渡すことができないという問題があります (生成された式の全長に依存します)。

したがって、渡す式を numexpr に制限するだけです。一定数のor条件を超えた場合、クエリはカーネル内の選択ではなくフィルターとして実行されます。基本的に、これはテーブルが読み取られてから再インデックスされることを意味します。

これは私の拡張リストにあります: https://github.com/pydata/pandas/issues/2391 (17)。

回避策として、クエリを複数に分割し、結果を連結してください。はるかに高速で、一定量のメモリを使用する必要があります

python - HDFStore: table.select と RAM の使用量

1 に答える 1

Related

Reference