r - Rテキストマイニング - Rデータフレーム列のテキストを単語頻度のある複数の列に変更する方法は?

Question

4列のデータフレームがあります。列 1 は ID で構成され、列 2 はテキスト (それぞれ約 100 語) で構成され、列 3 と 4 はラベルで構成されます。

ここで、テキスト列から (最も一般的な単語の) 単語の頻度を取得し、それらの頻度を余分な列としてデータフレームに追加したいと思います。列名を単語自体にし、列をテキスト内の頻度 (テキストごとに 0 から ... の範囲) で埋めたいと思います。

tm パッケージのいくつかの機能を試してみましたが、今のところ満足できません。この問題に対処する方法やどこから始めればよいか、誰にもわかりませんか? 仕事をすることができるパッケージはありますか？

id  texts   label1    label2

score 7 · Accepted Answer

それでは、問題を解決しましょう...

次のような data.frame があると思います。

       person sex adult                                 state code
1         sam   m     0         Computer is fun. Not too fun.   K1
2        greg   m     0               No it's not, it's dumb.   K2
3     teacher   m     1                    What should we do?   K3
4         sam   m     0                  You liar, it stinks!   K4
5        greg   m     0               I am telling the truth!   K5
6       sally   f     0                How can we be certain?   K6
7        greg   m     0                      There is no way.   K7
8         sam   m     0                       I distrust you.   K8
9       sally   f     0           What are you talking about?   K9
10 researcher   f     1         Shall we move on?  Good then.  K10
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11

このデータセットは qdap パッケージから取得されます。qdap を使用するにはinstall.packages("qdap").

ここで、データセットで話していた再現可能な例をDATA、qdap のデータセットでここで行っていることを実行します。

DATA
dput(head(DATA))

さて、元の問題については、あなたが望むことをすると思いwfmます：

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
data.frame(DATA, freqs, check.names = FALSE)

トップのみが必要な場合は、ここで使用するような順序付けテクニックを使用します。

freqs <- t(wfm(DATA$state, 1:nrow(DATA)))
ords <- rev(sort(colSums(freqs)))[1:9]      #top 9 words
top9 <- freqs[, names(ords)]                #grab those columns from freqs  
data.frame(DATA, top9, check.names = FALSE) #put it together

結果は次のようになります。

> data.frame(DATA, top9, check.names = FALSE)
       person sex adult                                 state code you we what not no it's is i fun
1         sam   m     0         Computer is fun. Not too fun.   K1   0  0    0   1  0    0  1 0   2
2        greg   m     0               No it's not, it's dumb.   K2   0  0    0   1  1    2  0 0   0
3     teacher   m     1                    What should we do?   K3   0  1    1   0  0    0  0 0   0
4         sam   m     0                  You liar, it stinks!   K4   1  0    0   0  0    0  0 0   0
5        greg   m     0               I am telling the truth!   K5   0  0    0   0  0    0  0 1   0
6       sally   f     0                How can we be certain?   K6   0  1    0   0  0    0  0 0   0
7        greg   m     0                      There is no way.   K7   0  0    0   0  1    0  1 0   0
8         sam   m     0                       I distrust you.   K8   1  0    0   0  0    0  0 1   0
9       sally   f     0           What are you talking about?   K9   1  0    1   0  0    0  0 0   0
10 researcher   f     1         Shall we move on?  Good then.  K10   0  1    0   0  0    0  0 0   0
11       greg   m     0 I'm hungry.  Let's eat.  You already?  K11   1  0    0   0  0    0  0 0   0

r - Rテキストマイニング - Rデータフレーム列のテキストを単語頻度のある複数の列に変更する方法は?

1 に答える 1

Related

Reference