python - パンダ: 値の行列計算

Question

次のようなデータフレームがあります。

        apple aple  apply
apple     0     0      0
aple      0     0      0
apply     0     0      0

Apple -> aple などの文字列の距離を計算したい。最終結果は次のとおりです。

        apple aple  apply
apple     0     32     14
aple      32    0      30
apply     14    30     0

現在、これは私が使用しているコードです（ただし、大きなデータの場合は非常に遅いです）：

columns = df.columns
for r in columns:
  for c in columns:
     m[r][c] = Simhash(r).distance(Simhash(c))

距離を効率的に計算するのを手伝ってくれる人はいますか?

score 1 · Accepted Answer

1つの考え-出力は対称であるため、すべてのペアを反復することにより、各ペアを2回計算しています。また、要素とそれ自体の比較をスキップすることもできます。したがって、少なくとも計算の数を減らすために、次のようなことができます。itertools を使用してペアの距離のみを計算し、pandas を使用して残りを埋めます。

from itertools import combinations
from collections import defaultdict

data = df.index

output = defaultdict(dict)

for a,b in combinations(data, 2):
    output[a][b] = Simhash(a).distance(Simhash(b))
for a in data:
    output[a][a] = 0

df = pd.DataFrame(output)

df = df.fillna(df.T)

より大きなフレームでテストする必要がありますが、あなたがしていることよりも速く、同じ答えが得られるはずです。

In [84]: df
Out[84]: 
       aple  apple  apply
aple      0     32     30
apple    32      0     14
apply    30     14      0

python - パンダ: 値の行列計算

1 に答える 1

Related

Reference