r - 文字列を解析し、後で再組み立てする

Question

文字列をその部分に解析し、各部分が別の語彙に存在するかどうかを確認し、後でその部分が語彙にある文字列のみを再構築しようとしています。語彙は単語のベクトルであり、比較したい文字列とは別に作成されます。最終的な目標は、単語部分が語彙に含まれる文字列のみを含むデータフレームを作成することです。

データを解析して文字列に変換するコードを書きましたが、比較の方法がわかりません。データの解析が最適な解決策ではないと思われる場合は、お知らせください。

以下に例を示します。3 つの文字列があるとします。

"The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue"

私の語彙は次の単語で構成されています。

cat,    **the**,    **elephant**,    hippo,
**in**,    run,    **is**,    bike,
walk,    **room, is, blue, cannot**

この場合、1 番目と 3 番目の文字列のみを選択します。これは、各単語部分がボキャブラリ内の単語と一致するためです。「犬」と「泳ぐ」という単語は語彙にないため、2 番目の文字列は選択しません。

ありがとうございました！

リクエストごとに、文字列をきれいにし、それらを一意の単語に解析するためにこれまでに書いたコードが添付されています。

animals <- c("The elephant in the room is blue", "The dog cannot swim", "The cat is blue")

animals2 <- toupper(animals)
animals2 <- gsub("[[:punct:]]", " ", animals2)
animals2 <- gsub("(^ +)|( +$)|(  +)", " ", animals2)

## Parse the characters and select unique words only
animals2 <- unlist(strsplit(animals2," "))
animals2 <- unique(animals2)

score 3 · Accepted Answer

ここで私はどのようにしますか：

データを読む
clean vocab で余分なスペースを削除し、*
を使用して、文字列をループしますsetdiff

私のコードは次のとおりです。

## read your data
tt <- c("The elephant in the room is blue",
"The dog cannot swim",
"The cat is blue")
vocab <- scan(textConnection('cat,    **the**,    **elephant**,    hippo,
**in**,    run,    **is**,    bike,
walk,    **room, is, blue, cannot**'),sep=',',what='char')
## polish vocab
vocab <- gsub('\\s+|[*]+','',vocab)
vocab <- vocab[nchar(vocab) >0]
##
 sapply(tt,function(x){
+     x.words <- tolower(unlist(strsplit(x,' '))) ## take lower (the==The)
+     length(setdiff(x.words ,vocab)) ==0
+ })
The elephant in the room is blue              The dog cannot swim                  The cat is blue 
                            TRUE                            FALSE                             TRUE

r - 文字列を解析し、後で再組み立てする

1 に答える 1

Related

Reference