python - キャパシテイテッド k-means クラスタリング?

Question

私はアルゴリズムと最適化の初心者です。capacitated k-means
を実装しようとしていますが、これまでのところ未解決で悪い結果になっています。これは、CVRP シミュレーションの一部として使用されます (capacitated vehicle routing problem)。参照されているアルゴリズムを間違って解釈するかどうか、興味があります。

参照: 「容量クラスタリング問題のための改善された K-Means アルゴリズム」(Geetha、Poonthalir、Vanathi)

シミュレートされた CVRP には 15 人の顧客がいて、1 つのデポがあります。
各顧客にはユークリッド座標 (x,y) と需要があります。
3台あり、それぞれ90台収容可能。

そのため、キャパシティテッド k-means は、15 人の顧客を 3 台の車両にクラスター化しようとします。各クラスターの総需要は車両容量を超えてはなりません。

アップデート：

参照されているアルゴリズムでは、コードが「次の最も近いセントロイド」を使い果たしたときにコードが何をしなければならないかについての情報をキャッチできませんでした。
つまり、以下のステップ 14.bcustomers[1]ですべての「最も近い重心」が調べられたとき、まだ割り当てられていません。

これにより、インデックス 1 の顧客が割り当てられなくなります。
注:customer[1]最大の需要 (30) を持つ顧客です。
Q: この条件が満たされた場合、コードは何をすべきですか?

これが参照されたアルゴリズムの私の解釈です。コードを修正してください。ありがとうございます。

指定nされた依頼者 (顧客)、n= customerCount、およびデポ
n 要求、
n 座標 (x,y)
クラスタ数を計算k= (すべての需要の合計) /vehicleCapacity
初期重心を選択する、
５．ａ． 5.b.に基づいて顧客をdemand降順にソート = d_customers、から最初の顧客を初期重心として
選択= ,kd_customerscentroids[0 .. k-1]
バイナリマトリックスbin_matrix、次元 = (customerCount) x (k)、
6.aを作成します。bin_matrixすべてゼロで埋める
WHILE ループを開始、条件 = WHILE not converged。
7.a.converged = False
FOR ループの開始、条件 = FOR each customers、
8.a. 顧客指数 = i
customers[i]からすべてへのユークリッド距離を計算するcentroids=> edist
9.a. edist昇順でソート、
9.b。距離が最も近いものを最初centroidに選択 =closest_centroid
start WILE ループ、条件 =while customers[i]はどのクラスターにも割り当てられていません。
他のすべての割り当てられていない顧客をグループ化する = G,
11.a. closest_centroidの重心と見なしGます。
12.a.Piのそれぞれcustomersの優先順位を計算しますG。
優先度Pi = (distance from customers[i] to closest_cent) / demand[i]
12.b. 優先度が最も高い顧客を選択しますPi。
12.c. 優先順位が最も高い顧客のインデックスはhpc
12.d です。Q: 最優先の顧客が見つからない場合、どうすればよいですか?
可能であれば割り当てcustomers[hpc]ます。 13.a. =の需要、 13.b。セントロイドのメンバーのすべての要求の合計 = , 13.c. .. 13.d. 13.eに割り当てます。更新、行インデックス = 、列インデックス = 、に設定。centroids[closest_centroid]
customers[hpc]d1
dtot
IF (d1 + dtot) <= vehicleCapacity, THEN
customers[hpc]centroids[closest_centroid]
bin_matrixhpcclosest_centroid1
IFcustomers[i]は (まだ)not assigned任意のクラスターに対して、THEN..
14.a. next nearest centroidから次に近い距離のを選択しますedist。
14.b. Q: 次に近いセントロイドがない場合、どうすればよいですか?
以前の行列と更新された行列 bin_matrix を比較して収束を計算します。
15.a. に変更がない場合はbin_matrix、を設定しconverged = Trueます。
new centroidsそれ以外の場合は、更新されたクラスターから計算します。
16.a. centroids' coordinates各クラスターのメンバーに基づいて新しい計算を行います。
16.b. sum_x=x-coordinateクラスター 16.cのすべての合計members。=クラスター内のすべての数、 16.d. クラスタの新しい重心= . 16.e. 同じ式で、クラスターの新しい重心を計算します = 。
num_ccustomers (members)
x-coordinatesum_x / num_c
y-coordinatesum_y / num_c
メインの WHILE ループを繰り返します。

私のコードは、ステップ 14.bで常に割り当てられていない顧客で終了します。
それは、customers[i]どの重心にもまだ割り当てられていないがあり、「次の最も近い重心」を使い果たしたときです。

そして、結果のクラスターは貧弱です。出力グラフ:

-写真では、星が重心、四角がデポです。
写真では、「1」とラベル付けされた顧客は、Demand=30 で、常にクラスターが割り当てられずに終了しました。

プログラムの出力、

k_cluster 3
idx [ 1 -1  1  0  2  0  1  1  2  2  2  0  0  2  0]
centroids [(22.6, 29.2), (34.25, 60.25), (39.4, 33.4)]
members [[3, 14, 12, 5, 11], [0, 2, 6, 7], [9, 8, 4, 13, 10]]
demands [86, 65, 77]

1 番目と 3 番目のクラスターは十分に計算されていません。
idxインデックス ' 1' は割り当てられていません ( -1)

Q: 私の解釈と実装の何が問題になっていますか?
修正、提案、ヘルプをいただければ幸いです。よろしくお願いします。

ここに私の完全なコードがあります:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# pastebin.com/UwqUrHhh
# output graph: i.imgur.com/u3v2OFt.png

import math
import random
from operator import itemgetter
from copy import deepcopy
import numpy
import pylab

# depot and customers, [index, x, y, demand]
depot = [0, 30.0, 40.0, 0]
customers = [[1, 37.0, 52.0, 7], \
             [2, 49.0, 49.0, 30], [3, 52.0, 64.0, 16], \
             [4, 20.0, 26.0, 9], [5, 40.0, 30.0, 21], \
             [6, 21.0, 47.0, 15], [7, 17.0, 63.0, 19], \
             [8, 31.0, 62.0, 23], [9, 52.0, 33.0, 11], \
             [10, 51.0, 21.0, 5], [11, 42.0, 41.0, 19], \
             [12, 31.0, 32.0, 29], [13, 5.0, 25.0, 23], \
             [14, 12.0, 42.0, 21], [15, 36.0, 16.0, 10]]
customerCount = 15
vehicleCount = 3
vehicleCapacity = 90
assigned = [-1] * customerCount

# number of clusters
k_cluster = 0
# binary matrix
bin_matrix = []
# coordinate of centroids
centroids = []
# total demand for each cluster, must be <= capacity
tot_demand = []
# members of each cluster
members = []
# coordinate of members of each cluster
xy_members = []

def distance(p1, p2):
    return math.sqrt((p1[0] - p2[0])**2 + (p1[1] - p2[1])**2)

# capacitated k-means clustering
# http://www.dcc.ufla.br/infocomp/artigos/v8.4/art07.pdf
def cap_k_means():
    global k_cluster, bin_matrix, centroids, tot_demand
    global members, xy_members, prev_members

    # calculate number of clusters
    tot_demand = sum([c[3] for c in customers])
    k_cluster = int(math.ceil(float(tot_demand) / vehicleCapacity))
    print 'k_cluster', k_cluster

    # initial centroids = first sorted-customers based on demand
    d_customers = sorted(customers, key=itemgetter(3), reverse=True)
    centroids, tot_demand, members, xy_members = [], [], [], []
    for i in range(k_cluster):
        centroids.append(d_customers[i][1:3])   # [x,y]

        # initial total demand and members for each cluster
        tot_demand.append(0)
        members.append([])
        xy_members.append([])

    # binary matrix, dimension = customerCount-1 x k_cluster
    bin_matrix = [[0] * k_cluster for i in range(len(customers))]

    converged = False
    while not converged:  # until no changes in formed-clusters
        prev_matrix = deepcopy(bin_matrix)

        for i in range(len(customers)):
            edist = []  # list of distance to clusters

            if assigned[i] == -1:  # if not assigned yet
                # Calculate the Euclidean distance to each of k-clusters
                for k in range(k_cluster):
                    p1 = (customers[i][1], customers[i][2]) # x,y
                    p2 = (centroids[k][0], centroids[k][1])
                    edist.append((distance(p1, p2), k))

                # sort, based on closest distance
                edist = sorted(edist, key=itemgetter(0))

            closest_centroid = 0    # first index of edist
            # loop while customer[i] is not assigned
            while assigned[i] == -1:  
                # calculate all unsigned customers (G)'s priority
                max_prior = (0, -1)   # value, index
                for n in range(len(customers)):
                    pc = customers[n]

                    if assigned[n] == -1:   # if unassigned
                        # get index of current centroid
                        c = edist[closest_centroid][1]
                        cen = centroids[c]     # x,y

                        # distance_cost / demand
                        p = distance((pc[1], pc[2]), cen) / pc[3]

                        # find highest priority
                        if p > max_prior[0]:
                            max_prior = (p, n)  # priority,customer-index

                # if highest-priority is not found, what should we do ???
                if max_prior[1] == -1:   
                    break

                # try to assign current cluster to highest-priority customer
                hpc = max_prior[1]    # index of highest-priority customer
                c = edist[closest_centroid][1]   # index of current cluster

                # constraint, total demand in a cluster <= capacity
                if tot_demand[c] + customers[hpc][3] <= vehicleCapacity:
                    # assign new member of cluster
                    members[c].append(hpc)   # add index of customer

                    xy = (customers[hpc][1], customers[hpc][2])  # x,y
                    xy_members[c].append(xy)

                    tot_demand[c] += customers[hpc][3]
                    assigned[hpc] = c   # update cluster to assigned-customer

                    # update binary matrix
                    bin_matrix[hpc][c] = 1

                # if customer is not assigned then,
                if assigned[i] == -1:
                    if closest_centroid < len(edist)-1:
                        # choose the next nearest centroid
                        closest_centroid += 1

                    # if run out of closest centroid, what must we do ???
                    else:
                        break   # exit without centroid ???

            # end while
        # end for

        # Calculate the new centroid from the formed clusters
        for j in range(k_cluster):
            xj = sum([cn[0] for cn in xy_members[j]])
            yj = sum([cn[1] for cn in xy_members[j]])
            xj = float(xj) / len(xy_members[j])
            yj = float(yj) / len(xy_members[j])
            centroids[j] = (xj, yj)

        # calculate converged
        converged = numpy.array_equal(numpy.array(prev_matrix), numpy.array(bin_matrix))
    # end while

def clustering():
    cap_k_means()

    # debug plot
    idx = numpy.array([c for c in assigned])
    xy = numpy.array([(c[1], c[2]) for c in customers])

    COLORS = ["Blue", "DarkSeaGreen", "DarkTurquoise", 
          "IndianRed", "MediumVioletRed", "Orange", "Purple"]

    for i in range(min(idx), max(idx)+1):
        clr = random.choice(COLORS)
        pylab.plot(xy[idx==i, 0], xy[idx==i, 1], color=clr, \
            linestyle='dashed', \
            marker='o', markerfacecolor=clr, markersize=8)
    pylab.plot(centroids[:][0], centroids[:][1], '*k', markersize=12)
    pylab.plot(depot[1], depot[2], 'sk', markersize=12)

    for i in range(len(idx)):
        pylab.annotate(str(i), xy[i])

    pylab.savefig('clust1.png')
    pylab.show()

    return idx

def main():
    idx = clustering()
    print 'idx', idx
    print 'centroids', centroids
    print 'members', members
    print 'demands', tot_demand

if __name__ == '__main__':
    main()

score 0 · Accepted Answer

すべての詳細を調べることなく、あなたが引用した論文は言う

 if ri is not assigned then
     choose the next nearest centroid
 end if

セクション 5 の最後のアルゴリズムで。

次に最も近い重心が存在する必要があります.2つが等距離である場合、どちらを選択しても問題ないと思います.

score 0 · Accepted Answer

固定サイズクラスタリングの一般的な問題は、出力で「スワップ」を識別できることが多いことです。この場合、クラスタ間で 2 つのポイントを交換すると、より良いソリューションが作成されます。

満杯ではない最も近いものを貪欲に選ぶのではなく、「ポイントを割り当てるクラスターを見つける」を割り当て問題として定式化することにより、参照された論文から制約付きk-meansアルゴリズムを改善できます。

これを解決する一般的な方法は、最小コストフローアルゴリズムを使用することです。この結果は、結果を改善する利用可能な「スワップ」がないことを保証します。

幸いなことに、誰かがすでにこれを実装しており、そのためのパッケージを作成しています: https://github.com/joshlk/k-means-constrained

Bradley, PS、Bennett, KP、およびDemiriz , A. (2000) を参照してください。制約付き k-means クラスタリング。

余談ですが、需要が非常に大きい顧客は、単一のデポに割り当てることができない可能性があるため、実行可能な解決策が見つかるまで、または需要を「分割」するまで、 kの値を増やす必要がある場合があります。複数のデポ間の顧客を許可する必要があります。

python - キャパシテイテッド k-means クラスタリング?

3 に答える 3

Related

Reference