r - 部分文字列 + キーワード周辺の単語を取得

Question

文字列がある場合:

moon <- "The cow jumped over the moon with a silver plate in its mouth"

の近くにある単語を抽出する方法はありますか"moon"。近隣は、「月」の周りの 2 つまたは 3 つの単語である可能性があります。

だから私の場合

"The cow jumped over the moon with a silver plate in its mouth"

私は自分の出力を次のようにしたいだけです：

"jumped over the moon with a silver"

文字で抽出したい場合に使用できることはわかっていますstr_locateが、「単語」を使用してそれを行う方法がわかりません。これはRで行うことができますか？

よろしくお願いします、シマック

score 4 · Accepted Answer

Use strsplit:

x <- strsplit(str, " ")[[1]]
i <- which(x == "moon")
paste(x[seq(max(1, (i-2)), min((i+2), length(x)))], collapse= " ")

score 4 · Accepted Answer

これが私がそれを行う方法です：

keyword <- "moon"
lookaround <- 2
pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword, 
                "( [[:alpha:]]+){0,", lookaround, "}")

regmatches(str, regexpr(pattern, str))[[1]]
# [1] "The cow jumped over"

アイデア:任意の文字を検索し、その後に最小で 0 回、最大で "lookaround" (ここでは 2 回) 発生するスペースが続き、その後に "keyword" (ここでは "moon") が続き、その後にスペースと一連の文字列が続きます。 0 から "lookaround" 回の間で繰り返される文字パターン。このregexpr関数は、このパターンの開始と停止を提供します。regmatchesこの関数をラップしてから、この開始/停止位置から部分文字列をフェッチします。

注:同じパターンの複数の出現を検索する場合regexprは、に置き換えることができます。gregexpr

Hongの回答と比較したビッグデータのベンチマークは次のとおりです。

str <- "The cow jumped over the moon with a silver plate in its mouth" 
ll <- rep(str, 1e5)
hong <- function(str) {
    str <- strsplit(str, " ")
    sapply(str, function(y) {
        i <- which(y=="moon")
        paste(y[seq(max(1, (i-2)), min((i+2), length(y)))], collapse= " ")
    })
}

arun <- function(str) {
    keyword <- "moon"
    lookaround <- 2
    pattern <- paste0("([[:alpha:]]+ ){0,", lookaround, "}", keyword, 
                    "( [[:alpha:]]+){0,", lookaround, "}")

    regmatches(str, regexpr(pattern, str))
}

require(microbenchmark)
microbenchmark(t1 <- hong(ll), t2 <- arun(ll), times=10)
# Unit: seconds
#            expr      min       lq   median       uq      max neval
#  t1 <- hong(ll) 6.172986 6.384981 6.478317 6.654690 7.193329    10
#  t2 <- arun(ll) 1.175950 1.192455 1.200674 1.227279 1.326755    10

identical(t1, t2) # [1] TRUE

score 2 · Accepted Answer

パッケージを使用したアプローチtmは次のとおりです（ハンマーしかない場合...）

moon <- "The cow jumped over the moon with a silver plate in its mouth"

require(tm)
my.corpus <- Corpus(VectorSource(moon))
# Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
neighborhood  <- 3 # how many words either side of word of interest
neighborhood1 <- 2 + neighborhood  * 2 
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = neighborhood1, max = neighborhood1))
dtm <- TermDocumentMatrix(my.corpus, control = list(tokenize = ngramTokenizer))
inspect(dtm)

#  find ngrams that have the word of interest in them
word <- 'moon'
subset_ngrams <- dtm$dimnames$Terms[grep(word, dtm$dimnames$Terms)]

# keep only ngrams with the word of interest in the middle. This
# removes duplicates and lets us see what's on either side
# of the word of interest

subset_ngrams <- subset_ngrams[sapply(subset_ngrams, function(i) {
  tmp <- unlist(strsplit(i, split=" "))
  tmp <- tmp[length(tmp) - span]
  tmp} == word)]

# inspect output
subset_ngrams
[1] "jumped over the moon with a silver plate"

r - 部分文字列 + キーワード周辺の単語を取得

3 に答える 3

Hongの回答と比較したビッグデータのベンチマークは次のとおりです。

Related

Reference