r - R で単語のリストを文のリストと照合しようとするときのパフォーマンスの問題

Question

単語のリストを文のリストと照合し、一致する単語と文でデータフレームを形成しようとしています。例えば：

words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")

期待される結果 (データフレーム) は次のとおりです。

sentences                                               words
This document is far better                               better
This is a great app                                       great
The night skies were sombre and starless                  sombre 
The app is too good and i am happy using it               good, happy
This is how it works                                      -

これを実現するために、次のコードを使用しています。

lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
        sentence <- sentence_df[i,]$comment
        #sentence <- gsub('[[:punct:]]', "", sentence)
        #sentence <- gsub('[[:cntrl:]]', "", sentence)
        #sentence <- gsub('\\d+', "", sentence)
        sentence <- tolower(sentence)
        # get  unigrams  from the sentence
        unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

        # get bigrams from the sentence
        bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

        # .. and combine into data frame
        words <- c(unigrams, bigrams)
        #if(sentence_df[i,]$ave_sentiment)

        pos.matches <- match(words, pos.words)
        neg.matches <- match(words, neg.words)
        pos.matches <- na.omit(pos.matches)
        neg.matches <- na.omit(neg.matches)
        positiveList <- pos.words[pos.matches]
        negativeList <- neg.words[neg.matches]

        if(length(positiveList)==0){
          positiveList <- c("-")
        }
        if(length(negativeList)==0){
          negativeList <- c("-")
        }
        negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
        positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")

        positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
        negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))

    }    
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)

一致させる 28,000 の文と 65,000 の単語があります。上記のコードは、タスクを完了するのに 45 秒かかります。現在のアプローチには多くの時間がかかるため、コードのパフォーマンスを改善する方法について何か提案はありますか?

編集：

文中の単語と完全に一致する単語のみを取得したい。例えば：

words <- c('sin','vice','crashes') 
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')

上記の場合、出力は次のようになります。

sentences                                                           words
Since the app crashes frequently, I advice you guys to fix        crahses
the issue ASAP

score 1 · Accepted Answer

@David Arenburgの回答をいくつか変更して使用できました。これが私がしたことです。以下を使用して (David が提案)、データフレームを形成しました。

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))

上記のアプローチの問題は、完全な単語一致を行わないことです。そこで、以下を使用して、文中の単語と完全に一致しない単語を除外しました。

df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))

上記の行を適用すると、出力データフレームは次のように変化します。

sentences                                                      words
This document is far better                                    better
This is a great app                                            great
The night skies were sombre and starless                       sombre 
The app is too good and i am happy using it                    good
The app is too good and i am happy using it                    happy
This is how it works                                            -
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 crahses
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 vice
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 sin

次に、次のフィルターをデータフレームに適用して、文に含まれる単語と完全に一致しない単語を削除します。

df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

結果のデータフレームは次のようになります。

    sentences                                                      words
    This document is far better                                    better
    This is a great app                                            great
    The night skies were sombre and starless                       sombre 
    The app is too good and i am happy using it                    good
    The app is too good and i am happy using it                    happy
    This is how it works                                            -
    Since the app crashes frequently, I advice you guys to fix     
    the issue ASAP                                                 crahses

stri_detect_fixed により、計算時間が大幅に短縮されました。残りのプロセスはそれほど時間はかかりませんでした。正しい方向に私を指摘してくれた@Davidに感謝します。

r - R で単語のリストを文のリストと照合しようとするときのパフォーマンスの問題

2 に答える 2

Related

Reference