r - 文字ベクトルからの一般的な単語ペアの抽出とカウント

Question

文字ベクトル内の隣接する単語の頻繁なペアを見つけるにはどうすればよいでしょうか? たとえば、原油データセットを使用すると、「原油」、「石油市場」、「百万バレル」などの一般的なペアがいくつかあります。

以下の小さな例のコードは、頻繁に使用される用語を特定しようとし、肯定的な先読みアサーションを使用して、それらの頻繁に使用される用語の直後に頻繁に使用される用語が何回続くかを数えます。しかし、その試みは墜落し、燃えました。

最初の列 (「ペア」) に共通のペアを表示し、2 番目の列 (「カウント」) にそれらがテキストに出現した回数を示すデータフレームを作成する方法について、ガイダンスをいただければ幸いです。

   library(qdap)
   library(tm)

# from the crude data set, create a text file from the first three documents, then clean it

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])
text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") # replace double spaces with single space
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

# pick the top 10 individual words by frequency, since they will likely form the most common pairs
freq.terms <- head(freq_terms(text.var = text), 10) 

# create a pattern from the top words for the regex expression below
freq.terms.pat <- str_c(freq.terms$WORD, collapse = "|")

# match frequent terms that are followed by a frequent term
library(stringr)
pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")

ここで努力が挫折します。

Java や Python を知らなかったので、これらはJava が単語のペアを数えるのに役立ちませんでした。

ありがとうございました。

score 3 · Accepted Answer

まず、最初のtextリストを次のように変更します。

text <- c(crude[[1]][1], crude[[2]][2], crude[[3]][3])

に：

text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1])

次に、テキストのクリーニングを続けることができます (メソッドはのような不適切な形式の単語を作成することに注意してください。ただし、"oilcanadian"手元の例では十分です)。

text <- tolower(text)
text <- tm::removeNumbers(text)
text <- str_replace_all(text, "  ", "") 
text <- str_replace_all(text, pattern = "[[:punct:]]", " ")
text <- removeWords(text, stopwords(kind = "SMART"))

新しいコーパスを構築します。

v <- Corpus(VectorSource(text))

バイグラムトークナイザー関数を作成します。

BigramTokenizer <- function(x) { 
  unlist(
    lapply(ngrams(words(x), 2), paste, collapse = " "), 
    use.names = FALSE
  ) 
}

TermDocumentMatrixコントロールパラメータを使用して作成しますtokenize。

tdm <- TermDocumentMatrix(v, control = list(tokenize = BigramTokenizer))

新しいtdmができたので、目的の出力を取得するには、次のようにします。

library(dplyr)
data.frame(inspect(tdm)) %>% 
  add_rownames() %>% 
  mutate(total = rowSums(.[,-1])) %>% 
  arrange(desc(total))

これにより、次のことが得られます。

#Source: local data frame [272 x 5]
#
#             rowname X1 X2 X3 total
#1          crude oil  2  0  1     3
#2            mln bpd  0  3  0     3
#3         oil prices  0  3  0     3
#4       cut contract  2  0  0     2
#5        demand opec  0  2  0     2
#6        dlrs barrel  2  0  0     2
#7    effective today  1  0  1     2
#8  emergency meeting  0  2  0     2
#9      oil companies  1  1  0     2
#10      oil industry  0  2  0     2
#..               ... .. .. ..   ...

score 1 · Accepted Answer

ここでの 1 つのアイデアは、バイグラムを使用して新しいコーパスを作成することです。

バイグラムまたはダイグラムは、トークンの文字列内の隣接する 2 つの要素のすべてのシーケンスです。

バイグラムを抽出する再帰関数:

bigram <- 
  function(xs){
    if (length(xs) >= 2) 
       c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))

  }

tm次に、これをパッケージの粗データに適用します。（ここでテキストのクリーニングを行いましたが、この手順はテキストによって異なります）。

res <- unlist(lapply(crude,function(x){

  x <- tm::removeNumbers(tolower(x))
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  freqs <- table(bigram(strsplit(x," ")[[1]]))
  freqs[freqs>1]
}))


 as.data.frame(tail(sort(res),5))
                          tail(sort(res), 5)
reut-00022.xml.hold_a                      3
reut-00022.xml.in_the                      3
reut-00011.xml.of_the                      4
reut-00022.xml.a_futures                   4
reut-00010.xml.abdul_aziz                  5

バイグラム "abdul aziz" と "a futures" が最も一般的です。(of, the,..) を削除するには、データを再クリーニングする必要があります。しかし、これは良いスタートになるはずです。

OPコメントの後に編集：

すべてのコーパスでバイグラム頻度を取得したい場合は、ループ内のバイグラムを計算してから、ループ結果の頻度を計算することをお勧めします。より良いテキスト処理クリーニングを追加することに利益があります。

res <- unlist(lapply(crude,function(x){
  x <- removeNumbers(tolower(x))
  x <- removeWords(x, words=c("the","of"))
  x <- removePunctuation(x)
  x <- gsub('\n|[[:punct:]]',' ',x)
  x <- gsub('  +','',x)
  ## after cleaning a compute frequency using table 
  words <- strsplit(x," ")[[1]]
  bigrams <- bigram(words[nchar(words)>2])
}))

xx <- as.data.frame(table(res))
setDT(xx)[order(Freq)]


#                 res Freq
#    1: abdulaziz_bin    1
#    2:  ability_hold    1
#    3:  ability_keep    1
#    4:  ability_sell    1
#    5:    able_hedge    1
# ---                   
# 2177:    last_month    6
# 2178:     crude_oil    7
# 2179:  oil_minister    7
# 2180:     world_oil    7
# 2181:    oil_prices   14

r - 文字ベクトルからの一般的な単語ペアの抽出とカウント

2 に答える 2

OPコメントの後に編集：

Related

Reference