python - 大規模で疎な行列のすべての列間のピアソン係数を計算する最も速い方法は何ですか?

Question

バックグラウンド

Amazon Review Dataのようなまばらなデータセットを取得しました。すべての列間の PCC (ピアソン相関係数) を計算し、後で再利用できるように保存したいと思います。ただし、結果が出るまでに時間がかかります。

たとえば、マトリックスには約 80 万列と 30 万行がありますが、各列について、2 つか 3 つの行だけが値を持ち、他の行は 0 (欠損値) です。

妥当な期間内に PCC 行列を取得することは可能ですか?

私が試したこと

私はこの仕事をするためにPythonを使用しています。私が試した方法は次のとおりです。

1.

import pandas as pd

# Gets the sparse DataFrame
dfs = pd.DataFrame(...)

# dfs.shape is (300k, 800k)

pcc = dfs.corr()

# save pcc

2.

# Transfers `dfs` to dense DataFrame dfd
# Format: (column_id, row_id, value)

vals = dfd.values
col_ids = np.unique(vals[:, 0]).tolist()

# Get all of the combinations between column indices.
# However, It takes about 2 BILLION iterations.
for i, j in combinations(col_ids, 2):
    # Get matrix of col_id equals `i` and `j`
    i_val = vals[vals[:, 0] == i]
    j_val = vals[vals[:, 0] == j]

    # Calc PCC of `i_val` and `j_val`
    pcc = pcc(i_val, j_val)
# Save all `pcc` into a matrix

Python では、単一のプロセスと単一のスレッドを使用して、for loop次のようにシミュレートして実行しました。

import progressbar
import time

total = 2000000000
for i in progressbar.progressbar(range(total)):
    time.sleep(0.005) # The actual time is much larger than 0.005s

約200日かかります...

解決策はありますか？

したがって、この問題を解決するのを手伝ってもらえますか、それとも別の角度から考えてみてください。

200DAYSありがとうございました！

python - 大規模で疎な行列のすべての列間のピアソン係数を計算する最も速い方法は何ですか?

バックグラウンド

私が試したこと

解決策はありますか？

0 に答える 0

Related

Reference