r - 効率的な jaccard 類似性 DocumentTermMatrix

Question

のドキュメント間の Jaccard 類似度を効率的に計算する方法が必要ですtm::DocumentTermMatrix。この回答に示されているように、 slamパッケージを介してコサイン類似度に対して同様のことを行うことができます。CrossValidated に関する別の質問と回答に出くわしました。これは R 固有のものでしたが、行列代数については必ずしも最も効率的なルートではありませんでした。より効率的なslam関数を使用してそのソリューションを実装しようとしましたが、DTM をマトリックスに強制し、を使用するという非効率的なアプローチを使用した場合と同じソリューションは得られません。proxy::dist

Rの大規模なDocumentTermMatrixのドキュメント間のJaccard類似度を効率的に計算するにはどうすればよいですか?

#データとパッケージ

library(Matrix);library(proxy);library(tm);library(slam);library(Matrix)

mat <- structure(list(i = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 
    2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L), j = c(1L, 1L, 2L, 2L, 3L, 3L, 
    4L, 4L, 4L, 5L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), v = c(1, 
    1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1), nrow = 4L, 
        ncol = 12L, dimnames = structure(list(Docs = c("1", "2", 
        "3", "4"), Terms = c("computer", "is", "fun", "not", "too", 
        "no", "it's", "dumb", "what", "should", "we", "do")), .Names = c("Docs", 
        "Terms"))), .Names = c("i", "j", "v", "nrow", "ncol", "dimnames"
    ), class = c("DocumentTermMatrix", "simple_triplet_matrix"), weighting = c("term frequency", 
    "tf"))

#非効率な計算 (期待される出力)

proxy::dist(as.matrix(mat), method = 'jaccard')

##       1     2     3
## 2 0.000            
## 3 0.875 0.875      
## 4 1.000 1.000 1.000

#私の試み

A <- slam::tcrossprod_simple_triplet_matrix(mat)
im <- which(A > 0, arr.ind=TRUE)
b <- slam::row_sums(mat)
Aim <- A[im]

stats::as.dist(Matrix::sparseMatrix(
      i = im[,1],
      j = im[,2],
      x = Aim / (b[im[,1]] + b[im[,2]] - Aim),
      dims = dim(A)
))

##     1   2   3
## 2 2.0        
## 3 0.1 0.1    
## 4 0.0 0.0 0.0

出力が一致しません。

参考までに、元のテキストは次のとおりです。

c("Computer is fun. Not too fun.", "Computer is fun. Not too fun.", 
    "No it's not, it's dumb.", "What should we do?")

ソリューションに見られるように、要素 1 と 2 は距離が 0 であり、要素 3 は要素 1 と 4 よりも要素 1 に近いと予想されます (単語が共有されていないため、最も遠い距離であると予想されます) proxy::dist。

編集

中規模の DTM でも行列が巨大になることに注意してください。ビーガンパッケージの例を次に示します。コサイン類似度が約 5 秒であるため、解決に 4 分かかることに注意してください。

library(qdap); library(quanteda);library(vegan);library(slam)
x <- quanteda::convert(quanteda::dfm(rep(pres_debates2012$dialogue), stem = FALSE, 
        verbose = FALSE, removeNumbers = FALSE), to = 'tm')


## <<DocumentTermMatrix (documents: 2912, terms: 3368)>>
## Non-/sparse entries: 37836/9769780
## Sparsity           : 100%
## Maximal term length: 16
## Weighting          : term frequency (tf)

tic <- Sys.time()
jaccard_dist_mat <- vegan::vegdist(as.matrix(x), method = 'jaccard')
Sys.time() - tic #Time difference of 4.01837 mins

tic <- Sys.time()
tdm <- t(x)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
Sys.time() - tic #Time difference of 5.024992 secs

score 3 · Accepted Answer

パッケージvegdist()からいかがですか？veganそれはCコードを使用し、約です。プロキシより 10 倍高速:

library(vegan)
vegdist(as.matrix(mat), method = 'jaccard')
##    1   2   3
## 2 0.0        
## 3 0.9 0.9    
## 4 1.0 1.0 1.0

library(microbenchmark)
matt <- as.matrix(mat)
microbenchmark(proxy::dist(matt, method = 'jaccard'),
               vegdist(matt, method = 'jaccard'))

## Unit: microseconds
##                                   expr      min        lq      mean
##  proxy::dist(matt, method = "jaccard") 4879.338 4995.2755 5133.9305
##      vegdist(matt, method = "jaccard")  587.935  633.2625  703.8335
##    median       uq      max neval
##  5069.203 5157.520 7549.346   100
##   671.466  723.569 1305.357   100

score 1 · Accepted Answer

パッケージから使用し、stringdistmatrixオプションを使用して並行して実行すると、かなり高速になります。コサイン類似度を使用したテストよりも平均で 6 秒遅くなります。stringdistnthread

library(qdap)
library(slam)
library(stringdist)
data(pres_debates2012)

x <- quanteda::convert(quanteda::dfm(rep(pres_debates2012$dialogue), stem = FALSE, 
                                     verbose = FALSE, removeNumbers = FALSE), to = 'tm')

tic <- Sys.time()
tdm <- t(x)
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
Sys.time() - tic #Time difference of 4.069233 secs

tic <- Sys.time()
t <- stringdistmatrix(pres_debates2012$dialogue, method = "jaccard", nthread = 4)
Sys.time() - tic #Time difference of 10.18158 secs

r - 効率的な jaccard 類似性 DocumentTermMatrix

3 に答える 3

Related

Reference