python - 関連するデータベースからエクスポートされたデータフレームオブジェクトの欠落データを処理するときのより良い解決策

Question

数日前、「pandas HDFStore 'put' 操作を高速化する方法」に関する質問を投稿しました。Jeff の回答に感謝します。db からデータを抽出して hdf5 ファイルに保存するより効率的な方法を見つけました。

しかし、この方法では、型に応じてすべての列に不足しているデータを入力し、すべてのテーブルでこれらの作業を行う必要があります (ほとんどの場合、この作業は反復的です)。そうしないと、データフレームを hdf5 ファイルに入れるときに、データフレーム内の None オブジェクトによってパフォーマンスの問題が発生します。

この仕事をするためのより良い方法はありますか？

この問題を読んだところです。「ENH: SQL から提供された NaN/NaT への変換」

NaT は他のタイプで動作しますか? (datetime64 を除く)
データフレームを hdf5 ファイルに保存するときのパフォーマンスの問題を心配することなく、データフレーム内のすべての None オブジェクトを置き換えることはできますか?

更新1

pd。バージョン: 0.10.1
現在、不足しているデータを埋めるために np.nan を使用していますが、2 つの問題に遭遇しました。
- np.nan と datetime.datetime obj の両方を持つ列は、'datetime64[ns]' 型に変換できず、hdfstore に入れると Excetion が発生します。

    [155]: len(df_bugs.lastdiffed[df_bugs.lastdiffed.isnull()])
    アウト[155]: 150

    [156]: len(df_bugs.lastdiffed)
    アウト[156]: 1003387

    [158]: df_bugs.lastdiffed.astype(df_bugs.creation_ts.dtype)

    -------------------------------------------------- -------------------------
    ValueError トレースバック (最新の呼び出しが最後)
     の （）
    ----> 1 df_bugs.lastdiffed.astype(df_bugs.creation_ts.dtype)

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/series.pyc in astype(self, dtype)
        777 numpy.ndarray.astype を参照
        778 """
    --> 779 キャスト = com._astype_nansafe(self.values, dtype)
        780 return self._constructor(キャスト、index=self.index、name=self.name)
        781

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/core/common.pyc in _astype_nansafe(arr, dtype)
       1047 elif arr.dtype == np.object_ および np.issubdtype(dtype.type, np.integer):
       1048 # NumPy の破損を回避、#1987
    -> 1049 return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
       1050
       1051 戻り arr.astype(dtype)

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/lib.so の pandas.lib.astype_intsafe (pandas/lib.c:11886) )()

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/lib.so in util.set_value_at (pandas/lib.c:44436)( )

    ValueError: datetime.date または datetime.datetime オブジェクトでなければなりません


        # df_bugs_sample1 = df_bugs.ix[:10000]
    [147]: %prun store.put('df_bugs_sample1', df_bugs_sample1, table=True)

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in put(self, key, value, table, append 、**kwargs)
        456 テーブル
        457"""
    --> 458 self._write_to_group(key, value, table=table, append=append, **kwargs)
        459
        460 def remove(self, key, where=None, start=None, stop=None):

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table 、追加、complib、**kwargs)
        786 raise ValueError('圧縮はテーブル以外ではサポートされていません')
        787
    --> 788 s.write(obj = value, append=append, complib=complib, **kwargs)
        789 s.is_table とインデックスの場合:
        790 s.create_index(列 = インデックス)

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib 、complevel、fletcher32、min_itemsize、chunksize、expectedrows、**kwargs)
       2489 # 軸を作成
       2490 self.create_axes(軸=軸、obj=obj、検証=追加、
    -> 2491 min_itemsize=min_itemsize, **kwargs)
       2492
       2493 self.is_exists でない場合:

    /usr/local/lib/python2.6/dist-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/io/pytables.pyc in create_axes(self、axes、obj、validate、nan_rep) 、data_columns、min_itemsize、**kwargs)
       2252 レイズ
       2253 例外 (例外)、詳細:
    -> 2254 raise Exception("正しいアトム型が見つかりません -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
       2255j+=1
       2256

    例外: 正しいアトム タイプが見つかりません -> [dtype->object,items->Index([bug_file_loc, bug_severity, bug_status, cf_branch, cf_bug_source, cf_eta, cf_public_severity, cf_public_summary, cf_regression, cf_reported_by, cf_type, guest_op_sys, host_op_sys, キーワード, lastdiffed, priority, rep_platform, resolution, short_desc, status_whiteboard, target_milestone], dtype=object)] タイプ「datetime.datetime」のオブジェクトに len() がありません

そして、他の df はデータフレームに完全に入れることができないようです。以下のサンプルのように、エントリ数は13742515ですが、データフレームを hdfstore に入れて取り出すと、エントリ数が1041998に変わります。〜

    [123]:df_bugs_activity
    アウト[123]:
    
    Int64Index: 13742515 エントリ、0 ～ 13742514
    データ列:
    13111366 個の非 null 値を追加
    attach_id 1041998 null 以外の値
    bug_id 13742515 null 以外の値
    bug_when 13742515 非 null 値
    fieldid 13742515 null 以外の値
    id 13742515 null 以外の値
    13612258 の非 null 値を削除
    who 13742515 非 null 値
    dtypes: datetime64[ns](1)、float64(1)、int64(4)、object(2)


    [121]: %time store.put('df_bugs_activity2', df_bugs_activity, table=True)

    CPU 時間: ユーザー 35.31 秒、sys: 4.23 秒、合計: 39.54 秒
    経過時間: 39.65 秒

    [122]: %time store.get('df_bugs_activity2')

    CPU 時間: ユーザー 7.56 秒、sys: 0.26 秒、合計: 7.82 秒
    経過時間: 7.84 秒
    アウト[122]:
    
    Int64Index: 1041998 エントリ、2012 ～ 13354656
    データ列:
    1041981 個の非 null 値を追加
    attach_id 1041998 null 以外の値
    bug_id 1041998 null 以外の値
    bug_when 1041998 非 null 値
    fieldid 1041998 非 null 値
    id 1041998 null 以外の値
    1041991 個の非 null 値を削除
    who 1041998 非 null 値
    dtypes: datetime64[ns](1)、float64(1)、int64(4)、object(2)

更新 2

データフレームを作成するためのコード:

    デフ グラブ_データ(テーブル名, サイズ_オブ_ページ=20000):
        '''
        db テーブルからデータを取得する

        size_of_page: SQL の limit サブクラスの 2 番目の引数
        '''
        cur.execute('select count(*) from %s' % table_name)
        records_number = cur.fetchone()[0]
        loop_number = records_number / size_of_page + 1
        print '****\nStart Grab %s\n****\nrecords_number: %s\nloop_number: %s' % (table_name, records_number, loop_number)

        開始位置 = 0
        df = DataFrame() # 警告: このデータフレーム オブジェクトにはテーブルのすべてのレコードが含まれるため、メモリの使用に注意してください。

        for i in range(0, loop_number):
            sql_export = 'select * from %s limit %s, %s' % (table_name, start_position, size_of_page)
            df = df.append(psql.read_frame(sql_export, conn), verify_integrity=False, ignore_index=True)

            start_position += size_of_page
            print 'start_position: %s' % start_position

        DFを返す

    df_bugs =grab_data('バグ')
    df_bugs = df_bugs.fillna(np.nan)
    df_bugs = df_bugs.convert_objects()

df_bugs の構造:

Int64Index: 1003387 エントリ、0 ～ 1003386
データ列:
エイリアス 0 の非 null 値
assigned_to 1003387 null 以外の値
bug_file_loc 498160 非 null 値
bug_id 1003387 null 以外の値
bug_severity 1003387 null 以外の値
bug_status 1003387 null 以外の値
category_id 1003387 null 以外の値
cclist_accessible 1003387 非ヌル値
cf_attempted 102160 null 以外の値
cf_branch 691834 非 null 値
cf_bug_source 1003387 null 以外の値
cf_build 357920 非 null 値
cf_change 324933 非ヌル値
cf_doc_impact 1003387 null 以外の値
cf_eta 7223 非ヌル値
cf_failed 102123 null 以外の値
cf_i18n_impact 1003387 非ヌル値
cf_on_hold 1003387 非ヌル値
cf_public_severity 1003387 NULL 以外の値
cf_public_summary 587944 null 以外の値
cf_regression 1003387 null 以外の値
cf_reported_by 1003387 null 以外の値
cf_reviewer 1003387 null 以外の値
cf_security 1003387 null 以外の値
cf_test_id 13475 null 以外の値
cf_type 1003387 NULL 以外の値
cf_viss 1423 ヌル以外の値
component_id 1003387 null 以外の値
creation_ts 1003387 null 以外の値
締め切り 0 非 null 値
delta_ts 1003387 null 以外の値
Estimated_time 1003387 非ヌル値
everconfirmed 1003387 個の非 null 値
found_in_phase_id 1003387 非ヌル値
found_in_product_id 1003387 null 以外の値
found_in_version_id 1003387 null 以外の値
guest_op_sys 1003387 null 以外の値
host_op_sys 1003387 非ヌル値
キーワード 1003387 非ヌル値
lastdiffed 1003237 非 null 値
優先度 1003387 非ヌル値
product_id 1003387 null 以外の値
qa_contact 1003387 null 以外の値
残り時間 1003387 ヌル以外の値
rep_platform 1003387 null 以外の値
レポーター 1003387 非 null 値
report_accessible 1003387 null 以外の値
解像度 1003387 非 null 値
short_desc 1003387 非ヌル値
status_whiteboard 1003387 null 以外の値
target_milestone 1003387 NULL 以外の値
投票 1003387 個の非 null 値
dtypes: datetime64[ns](2)、float64(10)、int64(19)、object(21)

アップデート 3

csv に書き込み、csv から読み取る:

    [184]: df_bugs.to_csv('df_bugs.sv')
    [185]: df_bugs_from_scv = pd.read_csv('df_bugs.sv')
    [186]: df_bugs_from_scv
    アウト[186]:
    
    Int64Index: 1003387 エントリ、0 ～ 1003386
    データ列:
    名前: 0 1003387 null 以外の値
    エイリアス 0 の非 null 値
    assigned_to 1003387 null 以外の値
    bug_file_loc 0 個の非 null 値
    bug_id 1003387 null 以外の値
    bug_severity 1003387 null 以外の値
    bug_status 1003387 null 以外の値
    category_id 1003387 null 以外の値
    cclist_accessible 1003387 非ヌル値
    cf_attempted 102160 null 以外の値
    cf_branch 345133 非ヌル値
    cf_bug_source 1003387 null 以外の値
    cf_build 357920 非 null 値
    cf_change 324933 非ヌル値
    cf_doc_impact 1003387 null 以外の値
    cf_eta 7223 非ヌル値
    cf_failed 102123 null 以外の値
    cf_i18n_impact 1003387 非ヌル値
    cf_on_hold 1003387 非ヌル値
    cf_public_severity 1003387 NULL 以外の値
    cf_public_summary 588 個の非 null 値
    cf_regression 1003387 null 以外の値
    cf_reported_by 1003387 null 以外の値
    cf_reviewer 1003387 null 以外の値
    cf_security 1003387 null 以外の値
    cf_test_id 13475 null 以外の値
    cf_type 1003387 NULL 以外の値
    cf_viss 1423 ヌル以外の値
    component_id 1003387 null 以外の値
    creation_ts 1003387 null 以外の値
    締め切り 0 非 null 値
    delta_ts 1003387 null 以外の値
    Estimated_time 1003387 非ヌル値
    everconfirmed 1003387 個の非 null 値
    found_in_phase_id 1003387 非ヌル値
    found_in_product_id 1003387 null 以外の値
    found_in_version_id 1003387 null 以外の値
    guest_op_sys 805088 非ヌル値
    host_op_sys 806344 非ヌル値
    キーワード 532941 非ヌル値
    lastdiffed 1003237 非 null 値
    優先度 1003387 非ヌル値
    product_id 1003387 null 以外の値
    qa_contact 1003387 null 以外の値
    残り時間 1003387 ヌル以外の値
    rep_platform 424213 非 null 値
    レポーター 1003387 非 null 値
    report_accessible 1003387 null 以外の値
    解像度 922282 非 null 値
    short_desc 1003287 非ヌル値
    status_whiteboard 0 個の非 null 値
    target_milestone 423276 非 null 値
    投票 1003387 個の非 null 値
    dtypes: float64(12)、int64(20)、オブジェクト(21)

score 1 · Accepted Answer

私は自分自身に答えます。ジェフの助けに感謝します。

まず、アップデート 1 での 2 つ目の問題 (「dfをデータフレームに完全に入れることができないようです」) が修正されました。

そして、私が遭遇した最大の問題は、Python のdatetime obj とNone obj の両方を含む列を処理することです。幸いなことに、0.11-dev 以降、pandas はより便利な方法を提供します。私は自分のプロジェクトで以下のコードを使用しました。いくつかの行にコメントを追加しました。他の人に役立つことを願っています:)

cur.execute('select * from table_name')
result = cur.fetchall()

# For details: http://www.python.org/dev/peps/pep-0249/#description
db_description = cur.description
columns = [col_desc[0] for col_desc in db_description]

# As the pandas' doc said, `coerce_float`: Attempt to convert values to non-string, non-numeric objects (like decimal.Decimal) to floating point
df = DataFrame(result, columns=columns, coerce_float=True)

# dealing the missing data
for column_name in df.columns:
    # Currently, calling function `fillna(np.nan) on a `datetime64[ns]` column will cause an exception
    if df[column_name].dtype.str != '<M8[ns]':
        df[column_name].fillna(np.nan)

# convert the type of columns which both have np.nan and datetime obj from 'object' to 'datetime64[ns]'(short as'<M8[ns]')
# find the table columns whose type is Date or Datetime
column_name_type_tuple = [column[:2] for column in db_description if column[1] in (10, 12)]
# check whose type is 'object'
columns_need_conv = [column_name for column_name, column_type in column_name_type_tuple if str(df[column_name].dtype) == 'object']

# do the type converting
for column_name in columns_need_conv:
    df[column_name] = Series(df[column_name].values, dtype='M8[ns]')

df = df.convert_objects()

この後、df は h5 ファイルに保存するのに適しているはずであり、「pickle」はもう必要ありません。

PS:

一部のプロファイル:
complib: 'lzo'、complevel: 1 つの table1、2つの
int 列と 1 つの datetime 列を含む 7,810,561 レコード、配置操作のコストは49 秒

table2、4 つの datetime 列、4 つの float64 列、19 の int 列、24 の object(string) 列を含む 1,008,794 レコード、配置操作のコストは170秒

python - 関連するデータベースからエクスポートされたデータフレームオブジェクトの欠落データを処理するときのより良い解決策

更新1

更新 2

アップデート 3

1 に答える 1

Related

Reference