r - Rで階層的にクラスタリングするためのより良い方法はありますか？

Question

行ごと、次に列ごとに階層的クラスタリングを行いたいと思います。私は解決策のこの完全なハックを思いついた：

#! /path/to/my/Rscript --vanilla
args <- commandArgs(TRUE)
mtxf.in <- args[1]
clusterMethod <- args[2]
mtxf.out <- args[3]

mtx <- read.table(mtxf.in, as.is=T, header=T, stringsAsFactors=T)

mtx.hc <- hclust(dist(mtx), method=clusterMethod)
mtx.clustered <- as.data.frame(mtx[mtx.hc$order,])
mtx.c.colnames <- colnames(mtx.clustered)
rownames(mtx.clustered) <- mtx.clustered$topLeftColumnHeaderName
mtx.clustered$topLeftColumnHeaderName <- NULL
mtx.c.t <- as.data.frame(t(mtx.clustered), row.names=names(mtx))
mtx.c.t.hc <- hclust(dist(mtx.c.t), method=clusterMethod)
mtx.c.t.c <- as.data.frame(mtx.c.t[mtx.c.t.hc$order,])
mtx.c.t.c.t <- as.data.frame(t(mtx.c.t.c))
mtx.c.t.c.t.colnames <- as.vector(names(mtx.c.t.c.t))
names(mtx.c.t.c.t) <- mtx.c.colnames[as.numeric(mtx.c.t.c.t.colnames) + 1]

write.table(mtx.c.t.c.t, file=mtxf.out, sep='\t', quote=F, row.names=T)

変数mtxf.inとmtxf.outは、それぞれ入力行列ファイルとクラスター化された出力行列ファイルを表します。変数clusterMethodは、、などのメソッドhclustの1つです。singleaverage

入力例として、データマトリックスを次に示します。

topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
row1    0       3       0       0       0       3
row2    6       6       6       6       6       6
row3    0       3       0       0       0       3
row4    6       6       6       6       6       6
row5    0       3       0       0       0       3
row6    0       3       0       0       0       3

このスクリプトを実行すると、左上隅の要素がから失われmtxf.inます。このスクリプトから得られる出力は次のとおりです。

col5    col4    col1    col3    col2    col6
row6    0       0       0       0       3       3
row5    0       0       0       0       3       3
row1    0       0       0       0       3       3
row3    0       0       0       0       3       3
row2    6       6       6       6       6       6
row4    6       6       6       6       6       6

私の質問：入力マトリックスファイルの元の構造を保持する方法を探すことに加えて、これがどれだけのメモリを消費するか、またはこれを行うためのより高速でクリーンな、より「R」のような方法があるかどうかもわかりません。

Rの行と列でクラスター化するのは本当に難しいですか？このスクリプトを改善する建設的な方法はありますか？アドバイスありがとうございます。

score 5 · Accepted Answer

データをクリーンアップした後（つまり、最初の列を削除した後）、これには実際には3行のコードが必要です。

データをクリーンアップします（最初の列から行名を割り当ててから、最初の列を削除します）：

dat <- mtfx.in
rownames(dat) <- dat[, 1]
dat <- dat[, -1]

クラスター化と並べ替え：

row.order <- hclust(dist(dat))$order
col.order <- hclust(dist(t(dat)))$order

dat[row.order, col.order]

結果：

     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

score 0 · Accepted Answer

正直なところ、あなたがやっていることのいくつかをなぜやっているのか完全にはわかりません。ですから、あなたが探しているものを誤解している可能性は十分にあります。ベースから離れている場合は、お知らせください。この回答を削除します。

row.names = 1しかし、最初の列が実際には行名であることを示すために使用してデータを読み取ると、あなたの生活ははるかに楽になる（そして実際には結果は正しい）と思います。例えば：

#Read the data in
d1 <- read.table(textConnection("topLeftColumnHeaderName col1    col2    col3    col4    col5    col6
 row1    0       3       0       0       0       3
 row2    6       6       6       6       6       6
 row3    0       3       0       0       0       3
 row4    6       6       6       6       6       6
 row5    0       3       0       0       0       3
 row6    0       3       0       0       0       3"),
   sep = "",as.is = TRUE,header = TRUE,
   stringsAsFactors = TRUE,row.names = 1)

#So d1 looks like this: 
d1
     col1 col2 col3 col4 col5 col6
row1    0    3    0    0    0    3
row2    6    6    6    6    6    6
row3    0    3    0    0    0    3
row4    6    6    6    6    6    6
row5    0    3    0    0    0    3
row6    0    3    0    0    0    3

#Simple clustering based on rows 
clus1 <- hclust(dist(d1))
d2 <- d1[clus1$order,]
d2
     col1 col2 col3 col4 col5 col6
row6    0    3    0    0    0    3
row5    0    3    0    0    0    3
row1    0    3    0    0    0    3
row3    0    3    0    0    0    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

#Now cluster on columns and display the result 
clus2 <- hclust(dist(t(d2)))
t(t(d2)[clus2$order,])
     col5 col4 col1 col3 col2 col6
row6    0    0    0    0    3    3
row5    0    0    0    0    3    3
row1    0    0    0    0    3    3
row3    0    0    0    0    3    3
row2    6    6    6    6    6    6
row4    6    6    6    6    6    6

これにタグを付けたのでcode-review、スタイル的にも指摘しておきますが、多くのRの人々は、マスクできるのに、ブール値を使用しないことを好みTます。FTRUEFALSE

r - Rで階層的にクラスタリングするためのより良い方法はありますか？

2 に答える 2

Related

Reference