0

rm_stopwordsパッケージの関数を使用qdapして、データ フレームのテキスト列からストップワードと句読点を削除しました。

library(qdap)
library(dplyr)
library(tm)

glimpse(dat_full)
Observations: 500
Variables: 9
$ reviewerID     <chr> "ABF0ARHORHUUC", "AH4KMS2YC6TXA", "A2IXK5LB...
$ asin           <chr> "B00BE6C9S0", "B009X78DKU", "B0077PM3KG", "...
$ reviewerName   <chr> "stuartm \"stuartm\"", "HottMess", "G. Farn...
$ helpful        <list> [<1, 2>, <0, 0>, <0, 0>, <0, 0>, <0, 0>, <...
$ reviewText     <chr> "I've used the Mophie juice pack for my iPh...
$ overall        <dbl> 3, 5, 5, 5, 5, 3, 3, 5, 5, 5, 5, 4, 5, 5, 3...
$ summary        <chr> "Case issues limit utility of this device",...
$ unixReviewTime <int> 1375142400, 1355356800, 1383350400, 1367193...
$ reviewTime     <chr> "07 30, 2013", "12 13, 2012", "11 2, 2013",...

full_dat$reviewText = rm_stopwords(full_dat$reviewText, 
tm::stopwords("english"), strip = TRUE)

この関数は、reviewText 列のリストを返します。

glimpse(full_dat)
Observations: 500
Variables: 9
$ reviewerID     <chr> "ABF0ARHORHUUC", "AH4KMS2YC6TXA", "A2IXK5LB...
$ asin           <chr> "B00BE6C9S0", "B009X78DKU", "B0077PM3KG", "...
$ reviewerName   <chr> "stuartm \"stuartm\"", "HottMess", "G. Farn...
$ helpful        <list> [<1, 2>, <0, 0>, <0, 0>, <0, 0>, <0, 0>, <...
$ reviewText     <list> [<"used", "mophie", "juice", "pack", "ipho...
$ overall        <dbl> 3, 5, 5, 5, 5, 3, 3, 5, 5, 5, 5, 4, 5, 5, 3...
$ summary        <chr> "Case issues limit utility of this device",...
$ unixReviewTime <int> 1375142400, 1355356800, 1383350400, 1367193...
$ reviewTime     <chr> "07 30, 2013", "12 13, 2012", "11 2, 2013",...

それを防ぐ方法(つまり、元の形式を維持する)、または列のリストを解除/ネスト解除して元の形式を返す方法についてのアイデアはありますか?

結果は元のデータ フレームのようになりますが、ストップワードと句読点はありません。

ここに小さなdputがあります:

structure(list(reviewerID = "A3LWYDTO7928SH", asin = "B00B0FT2T4", 
    reviewerName = "D. Lang", helpful = list(c(0L, 0L)), reviewText = "When I first put your glass protector on my phone I was blown away!  (I knew how &#34;degrading&#34; the soft plastic covers were - ruining my experience, so I chose not to have a protector on my screen.)  Then I saw your website and I wondered if it was as good as spoken about.  The answer is YES.  The application was flawless even after I pulled the glass back off because I had not put it on absolutely perfectly.  It repositioned with ease and you could not find a bubble if you had a microscope!  Fascinating to see the viscous material on the back spread out on its own!  Application could not be easier and the quality of the product seems like it came from NASA.", 
    overall = 5, summary = "It is as perfect as a product can get - Really!", 
    unixReviewTime = 1396569600L, reviewTime = "04 4, 2014"), row.names = 145945L, class = "data.frame")
4

1 に答える 1

1

dplyrパイプラインでこのようなもの。貼り付けと非表示の組み合わせを使用して結果を取得します。

full_dat <- dat_full %>% 
  mutate(reviewText = map_chr(reviewText, 
                          function(x) paste0(unlist(qdap::rm_stopwords(x, 
                                                                       tm::stopwords("english"), 
                                                                       strip = TRUE)), 
                                             collapse = " ") 
                          )
         )
于 2019-01-26T17:18:38.193 に答える