r - Kmeans を使用したドキュメントクラスタリングで BIC、AIC を使用してクラスタ数を推定する

Question

私のアプローチでは、KMEANS アルゴリズムを使用して一連のドキュメントをクラスタリングするための「k」の最適値を見つけようとしています。最適なモデルを見つけるために、「AIC」および「BIC」情報基準関数を使用したいと考えました。「k」の最適な値を見つけるために、このリソース「sherrytowers.com/2013/10/24/k-means-clustering/」を使用しました。

しかし、コードを実行すると、AIC と BIC の次のグラフが得られました。グラフから何も解釈できません。私の疑問は

私のアプローチは間違っていますか?これらの尺度 (AIC、BIC) は、Kmeans を使用したドキュメントクラスタリングには使用できませんか?
または、プログラミングロジックにエラーがあり、'AIC' と 'BIC' がクラスター数の 'k' を見つける正しい方法ですか?

これが私のコードです

library(tm)
library(SnowballC)
corp <- Corpus(DirSource("/home/dataset/"), readerControl = list(blank.lines.skip=TRUE));  ## forming Corpus from document set 
corp <- tm_map(corp, stemDocument, language="english")
dtm <- DocumentTermMatrix(corp,control=list(minwordlength = 1)) ## forming Document Term Matrix
dtm_tfidf <- weightTfIdf(dtm)
m <- as.matrix(dtm_tfidf)
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)

kmax = 50

totwss = rep(0,kmax) # will be filled with total sum of within group sum squares
kmfit = list() # create and empty list
for (i in 1:kmax){
  kclus = kmeans(m_norm,centers=i,iter.max=20)
  totwss[i] = kclus$tot.withinss
  kmfit[[i]] = kclus
}

kmeansAIC = function(fit){

  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + 2*m*k)
}
aic=sapply(kmfit,kmeansAIC)
plot(seq(1,kmax),aic,xlab="Number of clusters",ylab="AIC",pch=20,cex=2)


kmeansBIC = function(fit){

  m = ncol(fit$centers)
  n = length(fit$cluster)
  k = nrow(fit$centers)
  D = fit$tot.withinss
  return(D + log(n)*m*k)
}
bic=sapply(kmfit,kmeansBIC)
plot(seq(1,kmax),bic,xlab="Number of clusters",ylab="BIC",pch=20,cex=2)

これらは、生成されたグラフです http://snag.gy/oAfhk.jpg http://snag.gy/vT8fZ.jpg

r - Kmeans を使用したドキュメント クラスタリングで BIC、AIC を使用してクラスタ数を推定する

0 に答える 0

Related

Reference

r - Kmeans を使用したドキュメントクラスタリングで BIC、AIC を使用してクラスタ数を推定する