r - 列に基づいてデータフレームをセグメント化する

Question

2つの列を含むデータフレームがあります。1つは数値用、もう1つはラベルの例用です

基本的に、このデータフレームをセグメント化し、最初の列の任意の2つの値の差が1000未満であるという条件で、2番目の列を単語を含むベクトルに変換したいと思います。

Expected Result is

C("ABC","ADK")

ここでの例では、row4とrow3の差が1000を超えるため、単語としてABCとADKを持つベクトルCがあります。

多くのコンピューティングを消費せずにそれを行う方法についてのアイデアはありますか？

score 3 · Accepted Answer

私はこれをより大きなデータセットでテストしていませんが、以下は機能するはずです：

df <- data.frame(Col1=c(200, 300, 350, 2000, 2200, 2300), 
                 Col2=c("A", "B", "C", "A", "D", "K"))

sapply(split(df$Col2, 
             cumsum(c(1, (diff(df$Col1) > 1000)))), 
       paste, collapse="")
#     1     2 
# "ABC" "ADK"

上記の場合：

diff(df$Col1) > 1000TRUEおよびのベクトルを返しますFALSE
c(1, (diff(df$Col1) > 1000))その論理ベクトルを数値に強制し、最初のグループの開始点として1を追加します。したがって、次のようなベクトルが得られます1 0 0 1 0 0。
cumsum()これで、そのベクトルを使用して、データを分割する「グループ」を作成できます。
sapplyなどを実行して、関連する詳細をから貼り付け、Col2（名前付き）ベクトルを取得します。

score 2 · Accepted Answer

さらに別の答えは、あなたの問題がクラスター分析の典型的なケースであると誰もまだ言及していないからです。また、他のすべての回答は、すべてのペアワイズ距離を比較する必要があるときに、連続するポイント間の距離のみを比較しているという意味で間違っているためです。

任意の2点間の距離がしきい値未満である点のグループを見つけることは、階層的クラスタリングを介して、および完全なリンケージを使用して処理できます。Rでとても簡単です：

df <- data.frame(Col1 = c(200, 300, 350, 2000, 2200, 2300), 
                 Col2 = c("A", "B", "C", "A", "D", "K"))

tree <- hclust(dist(df$Col1), method = "complete")
groups <- cutree(tree, h = 1000)
# [1] 1 1 1 2 2 2
sapply(split(df$Col2, groups), paste, collapse = "")
#     1     2 
# "ABC" "ADK"

score 0 · Accepted Answer

あなたの説明に基づいて編集

# SAMPLE DATA
df <- data.frame(Col1=c(200, 300, 350, 2000, 2200, 2300, 4500), Col2=c("A", "B", "C", "A", "D", "K", "M"))
df

# Make sure they are the correct mode
df$Col1 <- as.numeric(as.character(df$Col1))
df$Col2 <- as.character(df$Col2)

lessThan <- which(abs(df$Col1[-length(df$Col1)] - df$Col1[-1]) > 1000 )

lapply(lessThan, function(ind)
  c( paste(df$Col2[1:ind], collapse=""),
      paste(df$Col2[ind+1:length(df$Col2)], collapse="") )
)

結果：

  [[1]]
  [1] "ABC"   "ADKM"

  [[2]]
  [1] "ABCADK" "M"

score 0 · Accepted Answer

1つのオプションがあります：

extractGroups <- function(data, threshold){
    #calculate which differences are greater than threshold between values in the first column
    dif <- diff(data[,1]) > threshold

    #edit: as @Ananda suggests, `cumsum` accomplishes these three lines more concisely.

    #identify where the gaps of > threshold are
    dif <- c(which(dif), nrow(data))        
    #identify the length of each of these runs
    dif <- c(dif[1], diff(dif))     
    #create groupings based on the lengths of the above runs
    groups <- inverse.rle(list(lengths=dif, values=1:length(dif)))

    #aggregate by group and paste the characters in the second column together
    aggregate(data[,2], by=list(groups), FUN=paste, collapse="")[,2]
}

そしてあなたのデータの例

extractGroups(read.table(text="1 200 A
2 300 B
3 350 C
4 2000 A
5 2200 D
6 2300 K", row.names=1), 1000)

[1] "ABC" "ADK"

r - 列に基づいてデータフレームをセグメント化する

4 に答える 4

あなたの説明に基づいて編集

Related

Reference