python - scipyによって作成された樹状図のカラークラスターに対応するフラットクラスタリングを取得する方法

Question

ここに投稿されたコードを使用して、優れた階層的クラスタリングを作成しました。

scipy樹状図

左側の樹状図が次のようなことを行って作成されたとしましょう

Y = sch.linkage(D, method='average') # D is a distance matrix
cutoff = 0.5*max(Y[:,2])
Z = sch.dendrogram(Y, orientation='right', color_threshold=cutoff)

次に、色付きの各クラスターのメンバーのインデックスを取得するにはどうすればよいですか？ この状況を単純化するために、上部のクラスタリングを無視し、マトリックスの左側の樹状図のみに焦点を合わせます。

この情報は、樹状図に保存されたZ変数に保存する必要があります。私が呼びたいことを実行する必要がある関数があります（ここfclusterのドキュメントを参照してください）。ただし、樹状図の作成で指定したのと同じfclusterをどこに指定できるかわかりません。のしきい値変数は、さまざまなあいまいな測定値（、、、、）に関するものでなければならないようです。何か案は？cutofffclustertinconsistentdistancemaxclustmonocrit

score 19 · Accepted Answer

あなたは正しい方向に進んでいると思います。これを試してみましょう：

import scipy
import scipy.cluster.hierarchy as sch
X = scipy.randn(100, 2)     # 100 2-dimensional observations
d = sch.distance.pdist(X)   # vector of (100 choose 2) pairwise distances
L = sch.linkage(d, method='complete')
ind = sch.fcluster(L, 0.5*d.max(), 'distance')

ind100個の入力観測値のそれぞれのクラスターインデックスが得られます。で使用しindたものによって異なります。、、、を試してください。次に、違いに注意してください。methodlinkagemethod=singlecompleteaverageind

例：

In [59]: L = sch.linkage(d, method='complete')

In [60]: sch.fcluster(L, 0.5*d.max(), 'distance')
Out[60]: 
array([5, 4, 2, 2, 5, 5, 1, 5, 5, 2, 5, 2, 5, 5, 1, 1, 5, 5, 4, 2, 5, 2, 5,
       2, 5, 3, 5, 3, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 5, 4, 1, 4, 5, 2, 1, 4,
       2, 4, 2, 2, 5, 5, 5, 2, 5, 5, 3, 5, 5, 4, 5, 4, 5, 3, 5, 3, 5, 5, 5,
       2, 3, 5, 5, 4, 5, 5, 2, 2, 5, 2, 2, 4, 1, 2, 1, 5, 2, 5, 5, 5, 1, 5,
       4, 2, 4, 5, 2, 4, 4, 2])

In [61]: L = sch.linkage(d, method='single')

In [62]: sch.fcluster(L, 0.5*d.max(), 'distance')
Out[62]: 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

scipy.cluster.hierarchy確かに紛らわしいです。あなたのリンクでは、私は自分のコードさえ認識していません！

score 5 · Accepted Answer

リンケージマトリックスを非凝縮化するためのコードをいくつか作成しました。labels凝集ステップごとにグループ化されたインデックスを含む辞書を返します。completeリンケージクラスターの結果でのみ試してみました。len(labels)+1最初は、各ラベルが独自のクラスターとして扱われるため、dictのキーはから始まります。これはあなたの質問に答えるかもしれません。

import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import linkage

np.random.seed(123)
labels = ['ID_0','ID_1','ID_2','ID_3','ID_4']

X = np.corrcoef(np.random.random_sample([5,3])*10)
row_clusters = linkage(x_corr, method='complete')    

def extract_levels(row_clusters, labels):
    clusters = {}
    for row in xrange(row_clusters.shape[0]):
        cluster_n = row + len(labels)
        # which clusters / labels are present in this row
        glob1, glob2 = row_clusters[row, 0], row_clusters[row, 1]

        # if this is a cluster, pull the cluster
        this_clust = []
        for glob in [glob1, glob2]:
            if glob > (len(labels)-1):
                this_clust += clusters[glob]
            # if it isn't, add the label to this cluster
            else:
                this_clust.append(glob)

        clusters[cluster_n] = this_clust
    return clusters

戻り値：

{5: [0.0, 2.0],
 6: [3.0, 4.0],
 7: [1.0, 0.0, 2.0],
 8: [3.0, 4.0, 1.0, 0.0, 2.0]}

score 1 · Accepted Answer

これはゲームに非常に遅れていることは知っていますが、ここの投稿のコードに基づいてプロットオブジェクトを作成しました。それはpipに登録されているので、インストールするには電話するだけです

pip install pydendroheatmap

ここでプロジェクトのgithubページをチェックしてください：https ：//github.com/themantalope/pydendroheatmap

score 1 · Accepted Answer

また、試すこともできますcut_tree。これには、超距離測定に必要なものを提供する高さパラメーターがあります。

python - scipyによって作成された樹状図のカラークラスターに対応するフラットクラスタリングを取得する方法

4 に答える 4

Related

Reference