python - k-means の距離を計算する際のラップアラウンド

Question

sklearn を使用して、一部のデータセットの K-means クラスタリングを実行しようとしています。問題は、次元の 1 つが 1 日の時間 (0 ～ 23 の数値) であることです。したがって、距離アルゴリズムは、0 が 23 から非常に離れていると見なします。これは、絶対的にはそうであるためです。実際には、私の目的では、時間 0 は時間 23 に非常に近いです。距離アルゴリズムに何らかの形式のラップアラウンドを実行させて、より「リアルな」時間差を計算する方法はありますか。次のような簡単なことをしています。

from sklearn.cluster import KMeans

clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)

data[22, 418, 192]elementsは、最初の要素が時間のように見えます。

何か案は？

score 3 · Accepted Answer

Even though @elyase answer is accepted, I think it is not the correct approach.

Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:

#current cluster X,, based on centroid position Xc=24
x1=1
x2=24

#current cluster Y, based on centroid position Yc=10
y1=12
y2=13

computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.

Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}

Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.

This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).

Details of the kernel k-means are given here

score 1 · Accepted Answer

k-means が任意の距離で機能しない理由

K-means は距離ベースのアルゴリズムではありません。

K-means は、分散の一種であるクラスター内の平方和を最小化します (これは、すべてのクラスターの加重平均分散であり、各オブジェクトとディメンションに同じ重みが与えられます)。

ロイズアルゴリズムを収束させるには、両方のステップで同じ関数を最適化する必要があります。

再割り当てステップ
重心の更新ステップ

現在、「平均」関数は最小二乗推定器です。つまり、ステップ 2 で平均値を選択することが、WCSS の目的にとって最適です。ステップ 1 で最小二乗偏差 (= 二乗ユークリッド距離、単調からユークリッド距離) によってオブジェクトを割り当てることでも、収束が保証されます。平均はまさに、ラップアラウンドのアイデアが崩壊する場所です。

@elyase によって提案されたランダムな他の距離関数をプラグインすると、k-means が収束しなくなる可能性があります。

適切な解決策

これにはさまざまな解決策があります。

K-medoid (PAM) を使用します。平均の代わりに medoid を選択すると、任意の距離での収束が保証されます。ただし、medoid の計算にはかなりのコストがかかります。
二乗和の最小化に満足しているカーネル空間にデータを変換します。たとえば、sin(hour / 12 * pi), cos(hour / 12 * pi)SSQ に適した時間に変換できます。
他の距離ベースのクラスタリングアルゴリズムを使用します。K-means は古く、それ以来、クラスタリングに関する多くの研究が行われてきました。階層的クラスタリング (実際には k-means と同じくらい古い) から始めて、DBSCAN とその変形を試してみてください。

score 0 · Accepted Answer

私にとって最も簡単なアプローチは、次元の「循環平均」を計算することにより、K-means アルゴリズムのラップアラウンド次元を適応させることです。もちろん、それに応じて重心までの距離の計算も変更する必要があります。

#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)

sin = np.sin(angles)
cos = np.cos(angles)

a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi

#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

python - k-means の距離を計算する際のラップアラウンド

3 に答える 3

k-means が任意の距離で機能しない理由

適切な解決策

Related

Reference