python - Pandas - read_hdf または store.select がクエリに対して誤った結果を返す

Question

pandas store.append を介して保存された大規模なデータセット (400 万行、50 列) があります。store.select または read_hdf のいずれかを、特定の値よりも大きい 2 列のクエリ (つまり、"(a > 10) & (b > 1)") で使用すると、15,000 ほどの行が返されます。

df のようにテーブル全体を読み込んで df[(df.a > 10) & (df.b > 1)] を実行すると、30,000 行が得られます。問題を絞り込みました-テーブル全体を読み込んで df.query("(a > 10) & (b > 1)") を実行すると、同じ15,000行ですが、エンジンをpythonに設定すると--- > df.query("(a > 10) & (b > 1)", engine = 'python') 30,000 行を取得します。

HDF および Query メソッドでクエリを実行する eval/numexpr メソッドと関係があると思われます。

タイプは列aとbのfloat64であり、float（つまり、1ではなく1.）でクエリを実行しても、問題は解決しません。

フィードバックをいただければ幸いです。または、他のユーザーが同じ問題を抱えている場合は、これを修正する必要があります。

よろしく、ニール

========================

情報は次のとおりです。

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 32
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
nose: 1.3.3
Cython: None
numpy: 1.8.0
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 1.2.1
sphinx: 1.2.2
patsy: 0.2.0
scikits.timeseries: 0.91.3
dateutil: 2.2
pytz: 2013.8
bottleneck: 0.7.0
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 2.0.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: None
html5lib: 0.95-dev
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

df.info() ---> 選択した 15,000 行ほどで

Int64Index: 15533 entries, 67302 to 142465

Data columns (total 47 columns):

date 15533 non-null datetime64[ns]
text 15533 non-null object
date2 1090 non-null datetime64[ns]
x1 15533 non-null float64
x2 15533 non-null float64
x3 15533 non-null float64
x4 15533 non-null float64
x5 15533 non-null float64
x6 15533 non-null float64
x7 15533 non-null float64
x8 15533 non-null float64
x9 15533 non-null float64
x10 15533 non-null float64
x11 15533 non-null float64
x12 15533 non-null float64
x13 15533 non-null float64
x14 15533 non-null float64
x15 15533 non-null float64
x16 15533 non-null float64
x17 15533 non-null float64
x18 15533 non-null float64
a 15533 non-null float64
x19 15533 non-null float64
x20 15533 non-null float64
x21 15533 non-null float64
x22 15533 non-null float64
x23 15533 non-null float64
x24 15533 non-null float64
b 15533 non-null float64
x25 15533 non-null float64
x26 15533 non-null float64
x27 15533 non-null float64
x28 15533 non-null float64
x29 15533 non-null float64
x30 15533 non-null float64
x31 15497 non-null float64
x32 15497 non-null float64
x33 15497 non-null float64
x34 15497 non-null float64
x35 15533 non-null int64
x36 15533 non-null int64
x37 15533 non-null int64
x38 15533 non-null int64
x39 15533 non-null int64
x40 15533 non-null int64
x41 15533 non-null int64
x42 15533 non-null int64
dtypes: datetime64ns, float64(36), int64(8), object(1)

ptdump -av ファイル

/ (RootGroup) ''
/._v_attrs (AttributeSet), 4 attributes:
[CLASS := 'GROUP',
PYTABLES_FORMAT_VERSION := '2.1',
TITLE := '',
VERSION := '1.0']
/MKT (Group) ''
/MKT._v_attrs (AttributeSet), 14 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'],
encoding := None,
index_cols := [(0, 'index')],
info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
levels := 1,
nan_rep := 'nan',
non_index_axes := [(1, ['date', 'text', 'date2', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'a', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'b', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42'])],
pandas_type := 'frame_table',
pandas_version := '0.10.1',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1', 'date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']]
/MKT/table (Table(3637597,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
"values_block_1": Float64Col(shape=(29,), dflt=0.0, pos=2),
"date": Int64Col(shape=(), dflt=0, pos=3),
"text": StringCol(itemsize=30, shape=(), dflt='', pos=4),
"a": Float64Col(shape=(), dflt=0.0, pos=5),
"x20": Float64Col(shape=(), dflt=0.0, pos=6),
"x23": Float64Col(shape=(), dflt=0.0, pos=7),
"x24": Float64Col(shape=(), dflt=0.0, pos=8),
"b": Float64Col(shape=(), dflt=0.0, pos=9),
"x25": Float64Col(shape=(), dflt=0.0, pos=10),
"x26": Float64Col(shape=(), dflt=0.0, pos=11),
"x35": Int64Col(shape=(), dflt=0, pos=12),
"x36": Int64Col(shape=(), dflt=0, pos=13),
"x37": Int64Col(shape=(), dflt=0, pos=14),
"x38": Int64Col(shape=(), dflt=0, pos=15),
"x39": Int64Col(shape=(), dflt=0, pos=16),
"x40": Int64Col(shape=(), dflt=0, pos=17),
"x41": Int64Col(shape=(), dflt=0, pos=18),
"x42": Int64Col(shape=(), dflt=0, pos=19)}
byteorder := 'little'
chunkshape := (322,)
autoindex := True
colindexes := {
"x41": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x20": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x37": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x42": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x26": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x38": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x40": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x36": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"text": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x23": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x39": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x25": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x24": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"a": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"x35": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"b": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/MKT/table._v_attrs (AttributeSet), 83 attributes:
[CLASS := 'TABLE',
x23_dtype := 'float64',
x23_kind := ['x23'],
x20_dtype := 'float64',
x20_kind := ['x20'],
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_10_FILL := 0.0,
FIELD_10_NAME := 'x25',
FIELD_11_FILL := 0.0,
FIELD_11_NAME := 'x26',
FIELD_12_FILL := 0,
FIELD_12_NAME := 'x35',
FIELD_13_FILL := 0,
FIELD_13_NAME := 'x36',
FIELD_14_FILL := 0,
FIELD_14_NAME := 'x37',
FIELD_15_FILL := 0,
FIELD_15_NAME := 'x38',
FIELD_16_FILL := 0,
FIELD_16_NAME := 'x39',
FIELD_17_FILL := 0,
FIELD_17_NAME := 'x40',
FIELD_18_FILL := 0,
FIELD_18_NAME := 'x41',
FIELD_19_FILL := 0,
FIELD_19_NAME := 'x42',
FIELD_1_FILL := 0,
FIELD_1_NAME := 'values_block_0',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'values_block_1',
FIELD_3_FILL := 0,
FIELD_3_NAME := 'date',
FIELD_4_FILL := '',
FIELD_4_NAME := 'text',
FIELD_5_FILL := 0.0,
FIELD_5_NAME := 'a',
FIELD_6_FILL := 0.0,
FIELD_6_NAME := 'x20',
FIELD_7_FILL := 0.0,
FIELD_7_NAME := 'x23',
FIELD_8_FILL := 0.0,
FIELD_8_NAME := 'x24',
FIELD_9_FILL := 0.0,
FIELD_9_NAME := 'b',
a_dtype := 'float64',
a_kind := ['a'],
NROWS := 3637597,
TITLE := '',
VERSION := '2.7',
x24_dtype := 'float64',
x24_kind := ['x24'],
b_dtype := 'float64',
b_kind := ['b'],
x25_dtype := 'float64',
x25_kind := ['x25'],
x26_dtype := 'float64',
x26_kind := ['x26'],
date_dtype := 'datetime64',
date_kind := ['date'],
x39_dtype := 'int64',
x39_kind := ['x39'],
x37_dtype := 'int64',
x37_kind := ['x37'],
x41_dtype := 'int64',
x41_kind := ['x41'],
x35_dtype := 'int64',
x35_kind := ['x35'],
x40_dtype := 'int64',
x40_kind := ['x40'],
x38_dtype := 'int64',
x38_kind := ['x38'],
x42_dtype := 'int64',
x42_kind := ['x42'],
x36_dtype := 'int64',
x36_kind := ['x36'],
index_kind := 'integer',
text_dtype := 'string240',
text_kind := ['text'],
values_block_0_dtype := 'datetime64',
values_block_0_kind := ['date2'],
values_block_1_dtype := 'float64',
values_block_1_kind := ['x22', 'x18', 'x21', 'x16', 'x19', 'x17', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x29', 'x30', 'x28', 'x2', 'x1', 'x3', 'x10', 'x27', 'x11', 'x12', 'x13', 'x14', 'x15', 'x33', 'x32', 'x34', 'x31']]

これが私が表で読む方法です：

df = DataFrame()store = pd.HDFStore('/Users/neil/MKT.h5')
df = store.select('MKT', "(a > 10) & (b > 1)")
store.close()

これが私がテーブルを書く/埋める方法です：

store = pd.HDFStore('/Users/neil/MKT.h5')

listofsearchablevars = ['date', 'text', 'a', 'x20', 'x23', 'x24', 'b', 'x25', 'x26', 'x35', 'x36', 'x37', 'x38', 'x39', 'x40', 'x41', 'x42']

df = .....

store.append('MKT', df, data_columns = listofsearchablevars, nan_rep = 'nan', chunksize=500000, min_itemsize = {'values': 30})

store.close()

編集: いくつかのサンプルデータを提供する要求への応答....

データ

分かりやすくするために、15,000 の結果を「INCORRECT」と呼びましょう。30,000 の結果を「CORRECT」と呼びましょう。INCORRECT ではなく、CORRECT のアイテムと呼びましょう:「Only in CORRECT」です。

INCORRECT のすべての行/項目が CORRECT に完全に含まれていることを確認しました。

それぞれのデータの行は次のとおりです (それぞれの行 10000 と 10001 を取得しました)。

正しい場合のみ:

                    9869                 9870
date   2001-08-10 00:00:00  2001-08-17 00:00:00
text                   DCR                  DCR
date2                  NaN                  NaN
x19                    1.9               1.8396
x18                   1.98                  1.9
x20                    1.8                  1.8
x9                    2.54                 2.54
x10                   5.25                5.125
x11                  9.625                9.625
x12                   1.61                  1.7
x13                   1.05                 1.05
x14                   1.05                 1.05
x21                  75700                64800
x23               140992.7             116948.9
x24           0.0008284454         0.0007097211
x25            0.002580505          0.002630241
x26            0.001540047          0.001440302
x27            0.001850877          0.001832468
x5                  17.915               17.915
x8                  17.915               17.915
x2                 34.0379              32.9563
a                  34.0385             32.95643
x6               -42.80079            -42.80079
x7               -8.762288            -9.844354
x4                       0                    0
x1           -0.0003349149        -0.0003349149
x3           -0.0003349149        -0.0003349149
x28              1.579e+07            1.579e+07
b                 1.261029             1.302433
x29               1.284075             1.326236
x30               1.488814             1.537697
x22             -0.2891579           -0.3205045
x17                   0.31                 0.31
x15                   0.84                 0.84
x16                 2.5937               2.5937
x34                  6.895                7.105
x32               -1.29055             -1.35055
x31                  -0.77                -0.63
x33                 -0.665                -0.49
x38                      1                    1
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

正しくない：

                    153641               153642
date   2008-08-22 00:00:00  2008-08-29 00:00:00
text                   PRL                  PRL
date2                  NaN                  NaN
x19                    1.9                 1.88
x18                   1.95                 1.94
x20                   1.85                 1.87
x9                    2.07                 2.07
x10                   2.23                 2.23
x11                   2.94                 2.94
x12                   1.75                 1.75
x13                   1.71                 1.71
x14                   1.69                 1.69
x21                 133549                73525
x23               254119.1             140764.5
x24            0.001485416         0.0008315729
x25            0.001227271          0.001204803
x26            0.001006876          0.001048327
x27           0.0009764919         0.0009638125
x5                  18.008               18.008
x8                  18.058               18.058
x2                 34.2152               33.855
a                  34.3102             33.94904
x6               -35.07229            -35.07229
x7              -0.7620911            -1.123251
x4                       0                    0
x1               0.0111308            0.0111308
x3               0.0111308            0.0111308
x28             1.5488e+08           1.5488e+08
b                 1.251983             1.265302
x29               1.272828             1.286369
x30               1.247996             1.261273
x22              0.1368421            0.1489362
x17                   0.16                 0.16
x15                    0.2                  0.2
x16                   0.47                 0.47
x34                   2.25                 2.34
x32                  1.395                1.365
x31                   1.25                 1.31
x33                  1.175                 1.25
x38                      1                    1
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

正しい：

                    99723                99725
date   2009-11-27 00:00:00  2009-12-11 00:00:00
text                   ACL                  ACL
date2                  NaN                  NaN
x19                   1.17                  1.2
x18                   1.22                 1.39
x20                   1.11                 1.14
x9                    1.76                 1.76
x10                   1.76                 1.76
x11                   1.76                 1.76
x12                   0.63                 0.74
x13                   0.36                 0.36
x14                   0.17                 0.17
x21                 285474               709374
x23               333678.1             868999.7
x24           0.0005489386          0.001393863
x25            0.002350057          0.002279827
x26            0.002160912          0.002111369
x27            0.002428953          0.002244943
x5                 103.908              103.908
x8                 103.908              103.908
x2                121.5721             124.6894
a                 121.5724             124.6896
x6                92.16074             92.16074
x7                213.7331             216.8503
x4                       0                    0
x1            -0.008266928         -0.008266928
x3            -0.008266928         -0.008266928
x28             0.02743141           0.02703708
b                 1.037747             1.011804
x29               1.421532             1.385994
x30                1.52714             1.488961
x22               1.213675                  1.7
x17                   0.47                 0.47
x15                   0.48                 0.48
x16                   0.48                 0.48
x34                   0.32                 0.32
x32                   1.04                 1.04
x31                   -0.6                 -0.6
x33                -0.5901               -0.479
x38                      0                    0
x42                      0                    0
x36                      0                    0
x40                      0                    0
x35                      0                    0
x39                      0                    0
x37                      0                    0
x41                      0                    0

score 0 · Accepted Answer

成功!!!!! データにすべての NaN を入力し、read_hdf が正しい 30,000 行を返すようになりました。列 a には NaN がありました (クエリの data_columns の 1 つで、a > 10)。男、それは痛かった。参考までに-私のパラノイアのため、これが将来繰り返される可能性のある状況を取り除くために、テーブルからの不適切または不完全なクエリでこの分析から結論に達する危険を冒すことができないため、テーブル全体を完全に塗りつぶします(0) . 確かにNaNの問題でした。