python - scipyリンケージフォーマット

Question

独自のクラスタリングルーチンを作成し、樹状図を作成したいと思います。これを行う最も簡単な方法は、scipy樹状図関数を使用することです。ただし、これには、scipyリンケージ関数が生成するのと同じ形式の入力が必要です。これの出力がどのようにフォーマットされているかの例を見つけることができません。誰かが私を啓発できるかどうか疑問に思いました。

score 48 · Accepted Answer

私はhttps://stackoverflow.com/users/1167475/mortonjtに同意しますが、ドキュメントでは中間クラスターのインデックス作成について完全には説明されていませんが、https：//stackoverflow.com/users/1354844/dkarには同意します。それ以外の場合、形式は正確に説明されています。

この質問のサンプルデータの使用：scipy.cluster.hierarchyのチュートリアル

A = np.array([[0.1,   2.5],
              [1.5,   .4 ],
              [0.3,   1  ],
              [1  ,   .8 ],
              [0.5,   0  ],
              [0  ,   0.5],
              [0.5,   0.5],
              [2.7,   2  ],
              [2.2,   3.1],
              [3  ,   2  ],
              [3.2,   1.3]])

リンケージマトリックスは、単一の（つまり、最も近い一致点）を使用して構築できます。

z = hac.linkage(a, method="single")

 array([[  7.        ,   9.        ,   0.3       ,   2.        ],
        [  4.        ,   6.        ,   0.5       ,   2.        ],
        [  5.        ,  12.        ,   0.5       ,   3.        ],
        [  2.        ,  13.        ,   0.53851648,   4.        ],
        [  3.        ,  14.        ,   0.58309519,   5.        ],
        [  1.        ,  15.        ,   0.64031242,   6.        ],
        [ 10.        ,  11.        ,   0.72801099,   3.        ],
        [  8.        ,  17.        ,   1.2083046 ,   4.        ],
        [  0.        ,  16.        ,   1.5132746 ,   7.        ],
        [ 18.        ,  19.        ,   1.92353841,  11.        ]])

ドキュメントで説明されているように、n以下のクラスター（ここでは11）は、元の行列Aのデータポイントにすぎません。今後の中間クラスターには、連続してインデックスが付けられます。

したがって、クラスター7と9（最初のマージ）はクラスター11にマージされ、クラスター4と6は12にマージされます。次に、行3を観察し、クラスター5（Aから）と12（表示されていない中間クラスター12から）をマージします。 0.5のクラスター内距離（WCD）。単一の方法では、新しいWCSが0.5である必要があります。これは、A [5]と、クラスター12、A[4]およびA[6]の最も近いポイントとの間の距離です。確認しよう：

 In [198]: norm([a[5]-a[4]])
 Out[198]: 0.70710678118654757
 In [199]: norm([a[5]-a[6]])
 Out[199]: 0.5

このクラスターは中間クラスター13になり、その後A[2]とマージされます。したがって、新しい距離は、点A[2]とA[4,5,6]の間で最も近いはずです。

 In [200]: norm([a[2]-a[4]])
 Out[200]: 1.019803902718557
 In [201]: norm([a[2]-a[5]])
 Out[201]: 0.58309518948452999
 In [202]: norm([a[2]-a[6]])
 Out[202]: 0.53851648071345048

ご覧のとおり、これもチェックアウトし、新しいクラスターの中間形式を説明します。

score 41 · Accepted Answer

これはscipy.cluster.hierarchy.linkage（）関数のドキュメントからのものであり、出力形式のかなり明確な説明だと思います。

A（n -1）x4の行列Zが返されます。i番目の反復で、インデックスZ [i、0]およびZ [i、1]を持つクラスターが結合されてクラスターn + iが形成されます。インデックスがn未満のクラスターは、元の観測値の1つに対応します。クラスターZ[i、0]とZ [i、1]の間の距離は、Z [i、2]で与えられます。4番目の値Z[i、3]は、新しく形成されたクラスター内の元の観測値の数を表します。

もっと何か必要ですか？

score 12 · Accepted Answer

dkarが指摘したように、scipyのドキュメントは正確です...しかし、返されたデータをさらに分析できるものに変換するのは少し難しいです。

私の意見では、データ構造のようなツリーでデータを返す機能を含める必要があります。以下のコードは、マトリックスを反復処理してツリーを構築します。

from scipy.cluster.hierarchy import linkage
import numpy as np

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
centers = np.concatenate((a, b),)

def create_tree(centers):
    clusters = {}
    to_merge = linkage(centers, method='single')
    for i, merge in enumerate(to_merge):
        if merge[0] <= len(to_merge):
            # if it is an original point read it from the centers array
            a = centers[int(merge[0]) - 1]
        else:
            # other wise read the cluster that has been created
            a = clusters[int(merge[0])]

        if merge[1] <= len(to_merge):
            b = centers[int(merge[1]) - 1]
        else:
            b = clusters[int(merge[1])]
        # the clusters are 1-indexed by scipy
        clusters[1 + i + len(to_merge)] = {
            'children' : [a, b]
        }
        # ^ you could optionally store other info here (e.g distances)
    return clusters

print create_tree(centers)

score 0 · Accepted Answer

同じ機能を実行する別のコードを次に示します。このバージョンは、各クラスター（node_id）の距離（サイズ）を追跡し、メンバーの数を確認します。

これは、Aggregatorclustererの同じ基盤であるscipylinkage（）関数を使用します。

from scipy.cluster.hierarchy import linkage
import copy
Z = linkage(data_x, 'ward')

n_points = data_x.shape[0]
clusters = [dict(node_id=i, left=i, right=i, members=[i], distance=0, log_distance=0, n_members=1) for i in range(n_points)]
for z_i in range(Z.shape[0]):
    row = Z[z_i]
    cluster = dict(node_id=z_i + n_points, left=int(row[0]), right=int(row[1]), members=[], log_distance=np.log(row[2]), distance=row[2], n_members=int(row[3]))
    cluster["members"].extend(copy.deepcopy(members[cluster["left"]]))
    cluster["members"].extend(copy.deepcopy(members[cluster["right"]]))
    clusters.append(cluster)

on_split = {c["node_id"]: [c["left"], c["right"]] for c in clusters}
up_merge = {c["left"]: {"into": c["node_id"], "with": c["right"]} for c in clusters}
up_merge.update({c["right"]: {"into": c["node_id"], "with": c["left"]} for c in clusters})

score 0 · Accepted Answer

入出力
 _

[入力]は、 4列の行列を返すリンケージを使用するときにデントグラムを描画することに関心のあるデータであると考えてください。

column1およびcolumn2-順番にクラスターの形成を

つまり、2と3は最初にクラスターを作成し、このクラスターは5と名付けられます
（2と3は2と3行目のインデックスを表します）1と5は、このクラスターが6と名付けられた2番目に形成されたクラスターです。

列3-クラスター間の距離を表します

列4-このクラスターの作成に関係するデータポイントの数を表します

デントグラム

python - scipyリンケージフォーマット

5 に答える 5

Related

Reference