r - コーパス内の文字ベクトル要素の数を取得する

Question

私の目標は、語彙ベースの感情分析に R を使用することです!

2 つの文字ベクトルがあります。1 つは肯定的な言葉で、もう 1 つは否定的な言葉で。例えば

pos <- c("good", "accomplished", "won", "happy")
neg <- c("bad", "loss", "damaged", "sued", "disaster")

私は今、何千ものニュース記事のコーパスを持っており、記事ごとに、私のベクトル pos と neg の要素が記事にいくつ含まれているかを知りたいです。

例 (コーパス関数がここでどのように機能するかはわかりませんが、アイデアはわかります: 私のコーパスには 2 つの記事があります)

mycorpus <- Corpus("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")

私はこのようなものを取得したい:

article 1: 2 element of pos and 0 element of neg
article 2: 0 elements of pos, 2 elements of neg

別の良いことは、各記事について次の情報を取得できる場合です。

(肯定語の数 - 否定語の数)/(記事内の合計単語数)

どうもありがとうございます！！

編集：

@ Victorp: これはうまくいかないようです

私が得るマトリックスはよさそうです：

mytdm[1:6,1:10]
               Docs
Terms          1 2 3 4 5 6 7 8 9 10
aaron          0 0 0 0 0 1 0 0 0  0
abandon        1 1 0 0 0 0 0 0 0  0
abandoned      0 0 0 3 0 0 0 0 0  0
abbey          0 0 0 0 0 0 0 0 0  0
abbott         0 0 0 0 0 0 0 0 0  0
abbotts        0 0 1 0 0 0 0 0 0  0

しかし、私があなたのコマンドを実行すると、すべてのドキュメントでゼロになります!

colSums(mytdm[rownames(mytdm) %in% pos, ])
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
  16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0

何故ですか？？

score 1 · Accepted Answer

別のアプローチを次に示します。

## pos <- c("good", "accomplished", "won", "happy")
## neg <- c("bad", "loss", "damaged", "sued", "disaster")
## 
## mycorpus <- Corpus(VectorSource(
##     list("The CEO is happy that they finally won the case.", 
##     "The disaster caused a huge loss.")))

library(qdap)
with(tm_corpus2df(mycorpus), termco(text, docs, list(pos=pos, neg=neg)))

##   docs word.count       pos       neg
## 1    1         10 2(20.00%)         0
## 2    2          6         0 2(33.33%)

score 1 · Accepted Answer

こんにちは、それを行うために TermDocumentMatrix を使用できます。

mycorpus <- Corpus(VectorSource(c("The CEO is happy that they finally won the case.", "The disaster caused a huge loss.")))
mytdm <- TermDocumentMatrix(mycorpus, control=list(removePunctuation=TRUE))
mytdm <- as.matrix(mytdm)

# Positive words
colSums(mytdm[rownames(mytdm) %in% pos, ])
1 2 
2 0 

# Negative words
colSums(mytdm[rownames(mytdm) %in% neg, ])
1 2 
0 2 

# Total number of words per documents
colSums(mytdm)
1 2 
9 5

r - コーパス内の文字ベクトル要素の数を取得する

2 に答える 2

Related

Reference