r - 用語ごとの頻度-RTMDocumentTermMatrix

Question

私はRに非常に慣れていないので、DocumentTermMatrixsに頭を悩ませることはできません。TMパッケージで作成されたDocumentTermMatrixがあり、その中に用語頻度と用語がありますが、それらにアクセスする方法がわかりません。

理想的には、私はしたい：

    Term  # 
    "the" 200 
    "is"  400 
    "a"   200

現在、私のコードは次のとおりです。

    library(tm)
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace)

    dtm <- DocumentTermMatrix(x)
    for(i in 1:length(common.words)) {
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
    }

これはstr（dtm）からの出力です

   List of 6
   $ i       : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
   $ j       : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
   $ v       : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
   $ nrow    : int 1477
   $ ncol    : int 3201
   $ dimnames:List of 2
   ..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
   ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

ありがとうございました、

-A

score 7 · Accepted Answer

これは、データのスパース行列編成のようです。頻度は「v」リストにあるようで、Terms属性で用語の位置を調べることでそれを取得できます。dput(head(results, 30))あなたのコード（そしてあなたのSOオーディエンス）が何かに取り組むことができるように提供してみませんか？パッケージ内の例を試してみたところ、実際には次のようなものが必要だと思います。

tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)

score 3 · Accepted Answer

私は同じ問題を抱えていて、より簡単な方法だと思うものを見つけました：

num <- 10 # Show this many top frequent terms

tdm[findFreqTerms(tdm)[1:num],] %>%
      as.matrix() %>%
      rowSums()

列に印刷するのは難しいです（誰かがこれよりもはるかに優れた方法を持っていると確信しています）：

terms <- findFreqTerms(tdm)[1:num]
tdm[terms,] %>%
      as.matrix() %>%
      rowSums()  %>% 
      data.frame(Term = terms, Frequency = .) %>%  
      arrange(desc(Frequency))

r - 用語ごとの頻度-RTMDocumentTermMatrix

2 に答える 2

Related

Reference