r - r の単語のステミングが期待どおりに機能しない

Question

R に由来する非常に単純な単語を実行しようとしていて、非常に予期しないものを取得しています。以下のコードでは、'complete' 変数は 'NA' です。easy という単語の語幹を完成させられないのはなぜですか?

library(tm) 
library(SnowballC)
dict <- c("easy")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)

ありがとう！

score 1 · Accepted Answer

stemCompletion()関数の内部はで見ることができますtm:::stemCompletion。

function (x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest")){
if(inherits(dictionary, "Corpus")) 
  dictionary <- unique(unlist(lapply(dictionary, words)))
type <- match.arg(type)
possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))
switch(type, first = {
  setNames(sapply(possibleCompletions, "[", 1), x)
}, longest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x), 
      decreasing = TRUE))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
}, none = {
  setNames(x, x)
}, prevalent = {
  possibleCompletions <- lapply(possibleCompletions, function(x) sort(table(x), 
      decreasing = TRUE))
  n <- names(sapply(possibleCompletions, "[", 1))
  setNames(if (length(n)) n else rep(NA, length(x)), x)
}, random = {
  setNames(sapply(possibleCompletions, function(x) {
      if (length(x)) sample(x, 1) else NA
  }), x)
}, shortest = {
  ordering <- lapply(possibleCompletions, function(x) order(nchar(x)))
  possibleCompletions <- mapply(function(x, id) x[id], 
      possibleCompletions, ordering, SIMPLIFY = FALSE)
  setNames(sapply(possibleCompletions, "[", 1), x)
})

}

x引数は語幹が付けられた用語でありdictionary、語幹が付けられていないものです。重要なのは 5 行目だけです。辞書用語のリスト内の語幹単語に対して単純な正規表現一致を行います。

possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s",w), dictionary, value = TRUE))

したがって、「easi」と「easy」の一致が見つからないため、失敗します。辞書に「easiest」という単語もある場合、一致する最初の 4 文字が同じ辞書単語があるため、両方の用語が一致します。

library(tm) 
library(SnowballC)
dict <- c("easy","easiest")
stem <- stemDocument(dict, language = "english")
complete <- stemCompletion(stem, dictionary=dict)
complete
     easi   easiest 
"easiest" "easiest"

score 0 · Accepted Answer

wordStem()するらしい..

library(tm) 
library(SnowballC)
dict <- c("easy")
> wordStem(dict)
[1] "easi"

r - r の単語のステミングが期待どおりに機能しない

2 に答える 2

Related

Reference