r - 各クラスター内のシーケンスを識別する方法は?

Question

の一部として提供される biofam データセットの使用TraMineR:

library(TraMineR)
data(biofam)
lab <- c("P","L","M","LM","C","LC","LMC","D")
biofam.seq <- seqdef(biofam[,10:25], states=lab)
head(biofam.seq)
     Sequence                                    
1167 P-P-P-P-P-P-P-P-P-LM-LMC-LMC-LMC-LMC-LMC-LMC
514  P-L-L-L-L-L-L-L-L-L-L-LM-LMC-LMC-LMC-LMC    
1013 P-P-P-P-P-P-P-L-L-L-L-L-LM-LMC-LMC-LMC      
275  P-P-P-P-P-L-L-L-L-L-L-L-L-L-L-L             
2580 P-P-P-P-P-L-L-L-L-L-L-L-L-LMC-LMC-LMC       
773  P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P

クラスター分析を実行できます。

library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))

ただし、このプロセスでは、biofam.seq の一意の ID が 1 から N までの数字のリストに置き換えられています。

head(cluster3, 10)
[1] Type 1 Type 2 Type 2 Type 2 Type 2 Type 3 Type 3 Type 2 Type 1
[10] Type 2
Levels: Type 1 Type 2 Type 3

ここで、各クラスター内にあるシーケンスを知りたいので、他の関数を適用して、各クラスター内の平均長、エントロピー、サブシーケンス、非類似度などを取得できます。私がする必要があるのは：

古い ID を新しい ID にマップする
各クラスターのシーケンスを個別のシーケンスオブジェクトに挿入する
新しいシーケンスオブジェクトごとに必要な統計を実行する

上記のリストの 2 と 3 をどのように達成できますか?

score 1 · Accepted Answer

これであなたの質問に答えられると思います。ここで見つけたコードhttp://www.bristol.ac.uk/cmm/software/support/workshops/materials/solutions-to-r.pdfを使用してを作成biofam.seqしました。

# create data
library(TraMineR)
data(biofam)
bf.states  <- c("Parent", "Left", "Married", "Left/Married", "Child",
                "Left/Child", "Left/Married/Child", "Divorced")
bf.shortlab <- c("P","L","M","LM","C","LC", "LMC", "D")
biofam.seq  <- seqdef(biofam[, 10:25], states = bf.shortlab,
                                       labels = bf.states)

# cluster
library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))

最初に、split各クラスターのインデックスのリストを作成するために使用し、それをlapplyループで使用してからサブシーケンスのリストを作成しますbiofam.seq。

# create a list of sequences
idx.list <- split(seq_len(nrow(biofam)), cluster3)
seq.list <- lapply(idx.list, function(idx)biofam.seq[idx, ])

lapply最後に、またはを使用して、各サブシーケンスで分析を実行できます。sapply

# compute statistics on each sub-sequence (just an example)
cluster.sizes <- sapply(seq.list, FUN = nrow)

通常、単一のシーケンスでFUN実行する任意の関数を指定できます。

score 1 · Accepted Answer

たとえば、最初のクラスターの状態シーケンスオブジェクトは、次のようにして簡単に取得できます。

bio1.seq <- biofam.seq[cluster3=="Type 1",]
summary(bio1.seq)

r - 各クラスター内のシーケンスを識別する方法は?

2 に答える 2

Related

Reference