python - Python Pandas による累積 OLS

Question

Pandas 0.8.1 を使用していますが、現時点ではバージョンを変更できません。新しいバージョンが以下の問題に役立つ場合は、回答ではなくコメントに記入してください。また、これは研究複製プロジェクトのためのものであるため、新しいデータポイントを 1 つだけ追加した後に回帰を再実行するのはばかげているかもしれませんが (データセットが大きい場合)、それでも実行する必要があります。ありがとう！

Pandas には、引数 toのrollingオプションがありますが、これにはウィンドウサイズの選択またはデフォルトとしてデータサンプル全体の使用が必要であることは暗示的に思われます。代わりに、すべてのデータを累積的に使用することを検討しています。window_typepandas.ols

pandas.DataFrame日付でソートされたで回帰を実行しようとしています。index ごとiに、最小の日付から index の日付までのデータを使用して回帰を実行したいと考えていますi。したがって、ウィンドウは反復ごとに効果的に 1 ずつ大きくなり、すべてのデータは最初の観測から累積的に使用され、ウィンドウからデータが削除されることはありません。

これを実行する関数 (以下) を書きましたapplyが、許容できないほど遅いです。pandas.ols代わりに、この種の累積回帰を直接実行する方法はありますか?

ここに私のデータに関するいくつかの詳細があります。pandas.DataFrame識別子の列、日付の列、左側の値の列、右側の値の列を含むがあります。groupby識別子に基づいてグループ化し、左側と右側の変数で構成される期間ごとに累積回帰を実行したいと考えています。

apply識別子でグループ化されたオブジェクトで使用できる関数は次のとおりです。

def cumulative_ols(
                   data_frame, 
                   lhs_column, 
                   rhs_column, 
                   date_column,
                   min_obs=60
                  ):

    beta_dict = {}
    for dt in data_frame[date_column].unique():
        cur_df = data_frame[data_frame[date_column] <= dt]
        obs_count = cur_df[lhs_column].notnull().sum()

        if min_obs <= obs_count:
            beta = pandas.ols(
                              y=cur_df[lhs_column],
                              x=cur_df[rhs_column],
                             ).beta.ix['x']
            ###
        else:
            beta = np.NaN
        ###
        beta_dict[dt] = beta
    ###

    beta_df = pandas.DataFrame(pandas.Series(beta_dict, name="FactorBeta"))
    beta_df.index.name = date_column
    return beta_df

score 1 · Accepted Answer

コメントのアドバイスに従って、OLS 単変量回帰からの係数をベクトル的に表現するために必要な個々の項をすべて累積するためapplyに使用できる独自の関数を作成しました。cumsum

def cumulative_ols(
                   data_frame,
                   lhs_column,
                   rhs_column,
                   date_column,
                   min_obs=60,
                  ):
    """
    Function to perform a cumulative OLS on a Pandas data frame. It is
    meant to be used with `apply` after grouping the data frame by categories
    and sorting by date, so that the regression below applies to the time
    series of a single category's data and the use of `cumsum` will work    
    appropriately given sorted dates. It is also assumed that the date 
    conventions of the left-hand-side and right-hand-side variables have been 
    arranged by the user to match up with any lagging conventions needed.

    This OLS is implicitly univariate and relies on the simplification to the
    formula:

    Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
    Var(x)   ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
    beta     ~ Cov(x,y) / Var(x)

    and the code makes a further simplification be cancelling one factor 
    of (1/n).

    Notes: one easy improvement is to change the date column to a generic sort
    column since there's no special reason the regressions need to be time-
    series specific.
    """
    data_frame["xy"]         = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
    data_frame["x2"]         = (data_frame[rhs_column]**2).fillna(0.0)
    data_frame["yobs"]       = data_frame[lhs_column].notnull().map(int)
    data_frame["xobs"]       = data_frame[rhs_column].notnull().map(int)
    data_frame["cum_yobs"]   = data_frame["yobs"].cumsum()
    data_frame["cum_xobs"]   = data_frame["xobs"].cumsum()
    data_frame["cumsum_xy"]  = data_frame["xy"].cumsum()
    data_frame["cumsum_x2"]  = data_frame["x2"].cumsum()
    data_frame["cumsum_x"]   = data_frame[rhs_column].fillna(0.0).cumsum()
    data_frame["cumsum_y"]   = data_frame[lhs_column].fillna(0.0).cumsum()
    data_frame["cum_cov"]    = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
    data_frame["cum_x_var"]  = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
    data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
    data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
    return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols

これが以前の関数の出力と NumPy のlinalg.lstsq関数の出力と一致することを多数のテストケースで確認しました。私はタイミングの完全なベンチマークを行っていませんが、逸話として、私が取り組んできたケースでは約 50 倍高速です。

python - Python Pandas による累積 OLS

1 に答える 1

Related

Reference