複数の参加者に 2 回出題された 8 つの質問からなるテキスト データがあります。text2vec を使用して、2 つの時点でのこれらの質問に対する回答の類似性を比較したいと考えています (重複検出)。これが私の初期データの構造です (この例では、参加者は 3 人で、質問は 8 つではなく 4 つ、期間は 2 四半期です)。第 1 四半期と第 2 四半期の各参加者の回答の類似性を比較したいと考えています。これを行うには、パッケージ text2vec の psim コマンドを使用するつもりです。
Joy,1,And another question,adsfjasljsdaf jkldfjkl
Joy,2,And another question,dsadsj jlijsad jkldf
Paul,1,And another question,adsfj aslj sd afs dfj ksdf
Paul,2,And another question,dsadsj jlijsad
Greg,1,And another question,adsfjasljsdaf
Greg,2,And another question, asddsf asdfasd sdfasfsdf
Joy,1,this is the first question that was asked,this is joys answer to this question
Joy,2,this is the first question that was asked,this is joys answer to this question
Paul,1,this is the first question that was asked,this is Pauls answer to this question
Paul,2,this is the first question that was asked,Pauls answer is different
Greg,1,this is the first question that was asked,this is Gregs answer to this question nearly the same
Greg,2,this is the first question that was asked,this is Gregs answer to this question
Joy,1,This is the text of another question,more random text
Joy,2,This is the text of another question, adkjjlj;ds sdafd
Paul,1,This is the text of another question,more random text
Paul,2,This is the text of another question, adkjjlj;ds sdafd
Greg,1,This is the text of another question,more random text
Greg,2,This is the text of another question,sdaf asdfasd asdff
Joy,1,this was asked second.,some random text
Joy,2,this was asked second.,some random text that doesn't quite match joy's response the first time around
Paul,1,this was asked second.,some random text
Paul,2,this was asked second.,some random text that doesn't quite match Paul's response the first time around
Greg,1,this was asked second.,some random text
Greg,2,this was asked second.,ada dasdffasdf asdf asdfa fasd sdfadsfasd fsdas asdffasd
", header=TRUE,sep=',')
vocabmkr<-function(x) {
itoken(x$AnswerText, ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 2) %>% vocab_vectorizer()
test<-lapply(questlist, vocabmkr)
最終的には、参加者が第 1 四半期と第 2 四半期の回答の一部またはすべてを複製しているかどうかを示す類似性スコアが必要です。
quest1 <- filter(df,Question=="this is the first question that was asked")
quest1vocab <- itoken(as.character(quest1$Answertext), ids=quest1$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
quest1q1<-itoken(as.character(quest1q1$Answertext),ids=quest1q1$ID) # tokenize question1 quarter 1
quest1q2<-itoken(as.character(quest1q2$Answertext),ids=quest1q2$ID) # tokenize question1 quarter 2
#now apply the vocabulary to the two matrices
similarity<-psim2(quest1q1, quest1q2, method="jaccard", norm="none") #row by row similarity.
b<-data.frame(ID=names(similarity),Similarity=similarity,row.names=NULL) #make dataframe of similarity scores
df1<-split.data.frame(df,df$Question) #now we have 4 dataframes in the list, 1 for each question
vocabmkr<-function(x) {
itoken(as.character(x$Answertext), ids=x$ID) %>% create_vocabulary()%>% prune_vocabulary(term_count_min = 1) %>% vocab_vectorizer()
vocab<-lapply(df1,vocabmkr) #this gets us another list and in it are the 4 vocabularies.
dfqq<-split.data.frame(df,list(df$Question,df$Quarter)) #and now we have 8 items in the list - each list is a combination of question and quarter (4 questions over 2 quarters)
vocab リスト (4 つの要素で構成される) を dfqq リスト (8 つの要素で構成される) に適用するにはどうすればよいですか?