r - Rのテキストデータにおけるツーペアの組み合わせの出現頻度

Question

複数の文字列 (テキスト) 変数を含むファイルがあり、各回答者が変数ごとに 1 文または 2 文を書いています。単語の各組み合わせの頻度 (つまり、「能力」が「パフォーマンス」とどのくらいの頻度で発生するか) を見つけられるようにしたいと考えています。これまでの私のコードは次のとおりです。

#Setting up the data file 
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")

#Change everything to lower text
data.text <- tolower(data.text)

#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)

#List each word and frequency
data.freq.list <- table(data.words.vector)

これにより、各単語のリストと、文字列変数に出現する頻度が表示されます。ここで、2 つの単語の組み合わせごとの頻度を確認したいと考えています。これは可能ですか？

ありがとう！

文字列データの例:

ID   Reason_for_Dissatisfaction    Reason_for_Likelihood_to_Switch
1    "not happy with the service"  "better value at other place"
2    "poor customer service"       "tired of same old thing"
3    "they are overchanging me"    "bad service"

score 1 · Accepted Answer

これが意味するものかどうかはわかりませんが、2つの単語の境界ごとに分割するのではなく（試して正規表現するのが面倒でした）、trustyheadとtailsslipのトリックを使用して2つの単語ごとに貼り付けることができます...

#  How I read your data
df <- read.table( text = 'ID   Reason_for_Dissatisfaction    Reason_for_Likelihood_to_Switch
1    "not happy with the service"  "better value at other place"
2    "poor customer service"       "tired of same old thing"
3    "they are overchanging me"    "bad service"
' , h = TRUE , stringsAsFactors = FALSE )


#  Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)

#  Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )

#  Table as per usual
table(unlist( outl ) )
are overchanging         at other      bad service     better value customer service 
               1                1                1                1                1 
      happy with        not happy          of same        old thing      other place 
               1                1                1                1                1 
 overchanging me    poor customer         same old      the service         they are 
               1                1                1                1                1 
        tired of         value at         with the 
               1                1                1

r - Rのテキストデータにおけるツーペアの組み合わせの出現頻度

1 に答える 1

Related

Reference