performance - データフレーム列の「名前変更」と「ドロップ」のパンダのパフォーマンスの問題

Question

以下は、関数の line_profiler レコードです。

Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s

File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1068                                           @profile
  1069                                           def _rpt_join(dfa, dfb, join_type='inner'):
  1070                                               ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
  1071                                                   'join_type' can be 'inner' or 'outer'
  1072                                               '''
  1073                                           
  1074         2           56     28.0      0.0      try:    # ('STK_ID','RPT_Date') are normal column
  1075         2      2936668 1468334.0     43.7          rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
  1076                                               except: # ('STK_ID','RPT_Date') are index
  1077                                                   rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
  1078                                                   
  1079                                           
  1080         2           81     40.5      0.0      try: # handle 'STK_Name
  1081         2       426472 213236.0      6.3          name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
  1082                                                   
  1083                                                   
  1084         2       900584 450292.0     13.4          nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
  1085                                                   
  1086         2      1138140 569070.0     16.9          rst.STK_Name_x = nameseries
  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)
  1089                                               except:
  1090                                                   pass
  1091                                           
  1092         2           94     47.0      0.0      return rst

私が驚いたのは、次の 2 行です。

  1087         2       596768 298384.0      8.9          rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
  1088         2       722293 361146.5     10.7          rst = rst.drop(['STK_Name_y'], axis=1)

単純なデータフレームの列"rename"と"drop"操作に、それほど多くの時間 (8.9% + 10.7%) のコストがかかるのはなぜですか? とにかく、"merge"操作のコストはわずか 43.7% で、"rename"/"drop" は計算集約的な操作のようには見えません。それを改善する方法は？

performance - データフレーム列の「名前変更」と「ドロップ」のパンダのパフォーマンスの問題

0 に答える 0

Related

Reference