r - R tm/qdap - 用語に基づいてドキュメントを取得する

Question

ドキュメント (この場合はツイート) に含まれる可能性のある用語に基づいてドキュメントを特定する方法を見つけようとしています。

このデータフレーム (df) があるとします。これは、Twitter ユーザーのスクリーンネームとそのツイートの 1 つのリストで構成されています。

> df
     ScreenName tweet                         
[1,] "Guy A"    "one random tweet"            
[2,] "Guy B"    "another random tweet"        
[3,] "Guy C"    "a third random piece of text"

さて、このデータフレーム内で、特定の用語を含むツイートを取得したいと思います-たとえば、「ツイート」-それらを次のように新しいデータフレーム(df2)に抽出します:

> df2
     ScreenName tweet                 
[1,] "Guy A"    "one random tweet"    
[2,] "Guy B"    "another random tweet"

tm または qdap パッケージを使用してそれを行う方法があるに違いないと思います。しかし、何も見つけることができなかったので、この混乱になってしまいました。

コーパスをクリーンアップした後、termDocumentMatrix に変換します

tdm <- TermDocumentMatrix(corpus, control=list(minWordLength=1))

次に、興味のある用語が Term Document Matrix のどの行にあるかを特定します。

t <- as.vector(tdm[term,])

サブセット - 用語が複数回言及されている場合

t.df <- as.data.frame(t)
t.sub <- subset(t.df, t >= 1)

文書番号（行番号）を取得

t.n <- as.numeric(rownames(t.sub))

t.tw - 用語に言及しているツイートのみを含む新しいデータフレームを作成し、to - 他のツイートを作成します。

t.tw <- tw[t.n,]
t.o <- tw[!1:nrow(tw) %in% t.n, ]

ご協力いただきありがとうございます。

上記の恐ろしいコードが熟練した R ユーザーの気分を害した場合は、お詫び申し上げます。

score 0 · Accepted Answer

私はこれのためにベースにとどまり、次の行でgrep関数を使用します（すでにがある場合data.frame）：

df[grep("tweet", df$tweet), ]

ここにあなたのデータ全体があります：

df <- read.table(text='ScreenName tweet                         
"Guy A"    "one random tweet"            
"Guy B"    "another random tweet"        
"Guy C"    "a third random piece of text"', header=TRUE)

df[grep("tweet", df$tweet), ]

##   ScreenName                tweet
## 1      Guy A     one random tweet
## 2      Guy B another random tweet

r - R tm/qdap - 用語に基づいてドキュメントを取得する

1 に答える 1

Related

Reference