r - R でのテキスト分析

Question

列を持つ大きなデータセット（460 Mb）があります-386551行のログ。クラスタリングと N-Gram アプローチを使用してワードクラウドを形成したいと考えています。私のコードは次のとおりです。

library(readr)
AMC <- read_csv("All Tickets.csv")
Desc <- AMC[,4]

#Very large data hence breaking it down for creating corpus
#DataframeSource has been used insted of VectorSource is to be able to       handle the data

library(tm)
docs_new <- data.frame(Desc)

test1 <- docs_new[1:100000,]
test2 <- docs_new[100001:200000,]
test3 <- docs_new[200001:300000,]
test4 <- docs_new[300001:386551,]
test1 <- data.frame(test1)
test1 <- Corpus(DataframeSource(test1))
test2 <- data.frame(test2)
test2 <- Corpus(DataframeSource(test2))
test3 <- data.frame(test3)
test3 <- Corpus(DataframeSource(test3))
test4 <- data.frame(test4)
test4 <- Corpus(DataframeSource(test4))

# attach all the corpus
docs_new <- c(test1,test2,test3,test4)

docs_new <- tm_map(docs_new, tolower)
docs_new <- tm_map(docs_new, removePunctuation)
docs_new <- tm_map(docs_new, removeNumbers)
docs_new <- tm_map(docs_new, removeWords, stopwords(kind = "en"))
docs_new <- tm_map(docs_new, stripWhitespace)
docs_new <- tm_map(docs_new, stemDocument)
docs_new <- tm_map(docs_new, PlainTextDocument)

#tokenizer for tdm with ngrams
library(RWeka)
options(mc.cores=1) 
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max =2))
tdm <- TermDocumentMatrix(docs_new, control = list(tokenize = BigramTokenizer))

これにより、次のような結果が得られます。

TermDocumentMatrix (terms: 1874071, documents: 386551)>>
Non-/sparse entries: 17313767/724406705354
Sparsity           : 100%
Maximal term length: 733
Weighting          : term frequency (tf)

次に、次を使用して dgMatrix に変換しました。

library("Matrix")
mat <- sparseMatrix(i=tdm$i, j=tdm$j, x=tdm$v, dims=c(tdm$nrow, tdm$ncol))

以下を使用しようとすると、メモリサイズエラーが発生します:

removeSparseTerms(tdm, 0.2)

Text Analytics は初めてなので、さらに提案してください。

r - R でのテキスト分析

0 に答える 0

Related

Reference