r - 互換性のない行列を結合して合計する

Question

私の目標は、行名と列名を使用して（そして保存して）、互換性のない2つの行列（異なる次元の行列）を「合計」することです。

私はこのアプローチを考え出しました。行列をdata.tableオブジェクトに変換し、それらを結合してから、列のベクトルを合計します。

例：

> M1
  1 3 4 5 7 8
1 0 0 1 0 0 0
3 0 0 0 0 0 0
4 1 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0
> M2
  1 3 4 5 8
1 0 0 1 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0
8 0 0 0 0 0
> M1 %ms% M2
  1 3 4 5 7 8
1 0 0 2 0 0 0
3 0 0 0 0 0 0
4 2 0 0 0 0 0
5 0 0 0 0 0 0
7 0 0 0 0 1 0
8 0 0 0 0 0 0

これは私のコードです：

M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6)
colnames(M1) <- c(1,3,4,5,7,8)
M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5)
colnames(M2) <- c(1,3,4,5,8)
# to data.table objects
DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn")
DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn")
# join and sum of common columns
if (nrow(DT1) > nrow(DT2)) {
    A <- DT2[DT1, roll = TRUE]
    A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn]
}

その出力：

   rn X1 X3 X4 X5 X7 X8
1:  1  0  0  2  0  0  0
2:  3  0  0  0  0  0  0
3:  4  2  0  0  0  0  0
4:  5  0  0  0  0  0  0
5:  7  0  0  0  0  1  0
6:  8  0  0  0  0  0  0

data.table次に、これをに変換して、matrix行名と列名を修正できます。

質問は次のとおりです。

この手順を一般化する方法は？

ディメンション（および行/列の名前）が事前にわからない行列にこの関数を適用しlist(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1)たいので、自動的に作成する方法が必要です。

要約すると、説明したように動作するマージ手順が必要です。
同じ目標を達成すると同時に、より速く一般化された他の戦略/実装がありますか？data.table（モンスターが私を助けてくれることを願っています）
この手順は、どのような種類の結合（内側、外側など）に同化できますか？

前もって感謝します。

ps：data.tableバージョン1.8.2を使用しています

編集-ソリューション

@アーロンソリューション。外部ライブラリはなく、ベースRのみです。行列のリストでも機能します。

add_matrices_1 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols))
  for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m
  out
}

@MadSconeソリューション。reshape2パッケージを使用します。呼び出しごとに2つの行列でのみ機能します。

add_matrices_2 <- function(m1, m2) {
  m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum)
  mn <- unique(colnames(m1), colnames(m2))
  rownames(m) <- mn
  colnames(m) <- mn
  m
}

@アーロンソリューション。Matrixパッケージを使用します。それはスパース行列でのみ機能し、それらのリストでも機能します。

add_matrices_3 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  nrows <- length(rows)
  ncols <- length(cols)
  newms <- lapply(a, function(m) {
    s <- summary(m)
    i <- match(rownames(m), rows)[s$i]
    j <- match(colnames(m), cols)[s$j]
    ilj <- i < j
    sparseMatrix(
      i         = ifelse(ilj, i, j),
      j         = ifelse(ilj, j, i),
      x         = s$x,
      dims      = c(nrows, ncols),
      dimnames  = list(rows, cols),
      symmetric = TRUE
    )
  })
  Reduce(`+`, newms)
}

ベンチマークmicrobenchmark（パッケージで100回実行）

Unit: microseconds
   expr                min         lq    median         uq       max
1 add_matrices_1   196.009   257.5865   282.027   291.2735   549.397
2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448

ベンチマークにコメントする必要はありません。@Aaronソリューションが勝ちます。

詳細

パフォーマンス（行列のサイズとスパース性に依存する）に関する洞察については、@ Aaronの編集（およびスパース行列のソリューション:)を参照してくださいadd_matrices_3。

score 6 · Accepted Answer

I'd just line up the names and go to town with base R.

Here's a simple function that takes an unspecified number of matrices and adds them up by their row/column names.

add_matrices_1 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  out <- array(0, dim=c(length(rows), length(cols)), dimnames=list(rows,cols))
  for(M in a) { out[rownames(M), colnames(M)] <- out[rownames(M), colnames(M)] + M }
  out
}

It then works like this:

# giving them rownames and colnames
colnames(M1) <- rownames(M1) <- c(1,3,4,5,7,8)
colnames(M2) <- rownames(M2) <- c(1,3,4,5,8)

add_matrices_1(M1, M2)
#   1 3 4 5 7 8
# 1 0 0 2 0 0 0
# 3 0 0 0 0 0 0
# 4 2 0 0 0 0 0
# 5 0 0 0 0 0 0
# 7 0 0 0 0 1 0
# 8 0 0 0 0 0 0

For bigger matrices, however, it doesn't do as well. Here's a function to make a matrix, choosing n columns out of N possibilities, and filling k spots with non-zero values. (This assumes symmetrical matrices.)

makeM <- function(N, n, k) {
  s1 <- sample(N, n)
  M1 <- array(0, dim=c(n,n), dimnames=list(s1, s1))
  r1 <- sample(n,k, replace=TRUE)
  c1 <- sample(n,k, replace=TRUE)
  M1[cbind(c(r1,c1), c(c1,r1))] <- sample(N,k)
  M1
}

Then here's another version that uses sparse matrices.

add_matrices_3 <- function(...) {
  a <- list(...)
  cols <- sort(unique(unlist(lapply(a, colnames))))
  rows <- sort(unique(unlist(lapply(a, rownames))))
  nrows <- length(rows)
  ncols <- length(cols)
  newms <- lapply(a, function(m) {
    s <- summary(m)
    i <- match(rownames(m), rows)[s$i]
    j <- match(colnames(m), cols)[s$j]
    ilj <- i<j
    sparseMatrix(i=ifelse(ilj, i, j),
                 j=ifelse(ilj, j, i),
                 x=s$x,
                 dims=c(nrows, ncols),
                 dimnames=list(rows, cols), symmetric=TRUE)
  })
  Reduce(`+`, newms)
}

This version is definitely faster when the matrices are large and sparse. (Note that I'm not timing the conversion to a sparse symmetric matrix, as hopefully if that's a suitable format, you'll use that format throughout your code.)

set.seed(50)
M1 <- makeM(10000, 5000, 50)
M2 <- makeM(10000, 5000, 50)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
system.time(add_matrices_1(M1, M2))
#   user  system elapsed 
#  2.987   0.841   4.133 
system.time(add_matrices_3(mm1, mm2))
#   user  system elapsed 
#  0.042   0.012   0.504

But when the matrices are small, my first solution is still faster.

set.seed(50)
M1 <- makeM(100, 50, 20)
M2 <- makeM(100, 50, 20)
mm2 <- Matrix(M2)
mm1 <- Matrix(M1)
microbenchmark(add_matrices_1(M1, M2), add_matrices_3(mm1, mm2))
# Unit: microseconds
#                       expr      min       lq   median        uq       max
# 1   add_matrices_1(M1, M2)  398.495  406.543  423.825  544.0905  43077.27
# 2 add_matrices_3(mm1, mm2) 5734.623 5937.473 6044.007 6286.6675 509584.24

Moral of the story: Size and sparsity matter.

Also, getting it right is more important than saving a few microseconds. It's almost always best to use simple functions and don't worry about speed unless you run into trouble. So in small cases, I'd prefer MadScone's solution, as it's easy to code and simple to understand. When that gets slow, I'd write a function like my first attempt. When that gets slow, I'd write a function like my second attempt.

score 3 · Accepted Answer

Here is a data.table solution. The magic is to add the .SD components (which have identical names in both) then assign the remaining column by reference.

# a function to quickly get the non key columns
nonkey <- function(DT){ setdiff(names(DT),key(DT))}
# the columns in DT1 only
notinR <- setdiff(nonkey(DT1), nonkey(DT2))

#calculate; .. means "up one level"
result <- DT2[DT1, .SD + .SD, roll= TRUE][,notinR := unclass(DT1[, ..notinR])]

# re set the column order to the original (DT1) order
setcolorder(result, names(DT1))

# voila!
result

   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0

I'm not convinced this is a particularly stable solution, given that I'm not sure it isn't fluking the answer because M1 and M2 are subsets of eachother

Edit, an ugly approach using eval

This is made harder because you have non-syntatic names (`1` etc)

inBoth <- intersect(nonkey(DT1), nonKey(DT2))

 backquote <- function(x){paste0('`', x, '`')}
 bqBoth <- backquote(inBoth)

 charexp <- sprintf('list(%s)',paste(c(paste0( bqBoth,'=',  bqBoth, '+ i.',inBoth), backquote(notinR)), collapse = ','))

result2 <- DT2[DT1,eval(parse(text = charexp)), roll = TRUE]
 setcolorder(result2, names(DT1))

# voila!
result2


   rn 1 3 4 5 7 8
1:  1 0 0 2 0 0 0
2:  3 0 0 0 0 0 0
3:  4 2 0 0 0 0 0
4:  5 0 0 0 0 0 0
5:  7 0 0 0 0 1 0
6:  8 0 0 0 0 0 0

score 1 · Accepted Answer

I think I managed to do it with this single disgusting line:

cast(aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum), X1 ~ X2)[,-1]

This makes use of the reshape package. Returned as a data frame so convert to matrix as necessary.

If you want it in the format you suggested in your example, try this:

"%ms%" <- function(m1, m2) {
  m <- as.matrix(cast(aggregate(value ~ X1 + X2, rbind(melt(m1), melt(m2)), sum), X1 ~ X2)[,-1])
  mn <- unique(colnames(m1), colnames(m2))
  rownames(m) <- mn
  colnames(m) <- mn
  return (m)
}

Then you can do:

M1 %ms% M2

EDIT:

EXPLANATION

Obviously should have some explanation sorry.

melt(M1)

Converts M1 from its original form into a format like this (row, col, value). E.g.

    1 3 4 5 7 8
  1 0 0 1 0 0 0
  3 0 0 0 0 0 0
  4 1 0 0 0 0 0
  5 0 0 0 0 0 0
  7 0 0 0 0 1 0
  8 0 0 0 0 0 0

Is converted to:

  X1 X2 value 
1  1  1     0
2  3  1     0
3  4  1     1

etc. Combining M1 and M2 lists every possible (row, col, value) across both matrix into one single matrix. Now this:

aggregate(value ~ X1 + X2, rbind(melt(M1), melt(M2)), sum)

Sums values where the row and column are the same. So it will sum (1, 1) across both matrices for example. And (3, 1) etc. It won't do anything that doesn't exist e.g. M2 doesn't have a 7th column/row.

Finally cast transforms the matrix so that it is written with the result of aggregate's first column as rows, its second column as columns. Effectively undoing the melt from earlier. The [,-1] is taking off an unnecessary column leftover from cast (I think there is probably a better way of doing that but I don't know how).

As I said, it's returned as a data frame so use as.matrix() on the result if that's what you wish.

r - 互換性のない行列を結合して合計する

3 に答える 3

EXPLANATION

Related

Reference