python - Pythonでコサイン類似度を使用して、クエリドキュメントと比較して最も類似したドキュメントを返します

Question

私は一連のファイルとクエリ doc を持っています。私の目的は、各ドキュメントのクエリ doc と比較して、最も類似したドキュメントを返すことです。最初にコサイン類似度を使用するには、ドキュメント文字列をベクトルにマップする必要があります。ドキュメントごとに計算する tf-idf 関数を作成しました。

文字列のインデックスを取得するには、そのような関数があります。

def getvectorKeywordIndex(self, documentList):
    """ create the keyword associated to the position of the elements within the    document vectors """
    #Mapped documents into a single word string
    vocabularyString = " ".join(documentList)
    vocabularylist= vocabularyString.split(' ')
    vocabularylist= list(set(vocabularylist))
    print 'vocabularylist',vocabularylist
    vectorIndex={}
    offset=0
    #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
    for word in vocabularylist:
        vectorIndex[word]=offset
        offset+=1
  print vectorIndex
  return vectorIndex,vocabularylist  #(keyword:position),vocabularylist

コサイン類似度の場合、私の機能は次のとおりです。

 def cosine_distance(self,index, queryDoc):

    vector1= self.makeVector(index)
    vector2= self.makeVector(queryDoc)

    return numpy.dot(vector1, vector2) / (math.sqrt(numpy.dot(vector1, vector1)) * math.sqrt(numpy.dot(vector2, vector2)))

TF-IDFは;

def tfidf(self, term, key):

    return (self.tf(term,key) * self.idf(term))

私の問題は、インデックスと語彙リスト、およびこの関数内の tf-idf を使用して makevector を作成するにはどうすればよいかということです。どんな答えでも大歓迎です。

score 2 · Accepted Answer

同様にに渡しvectorIndex、それを使用して、ドキュメントおよびクエリ内の用語のインデックスを検索する必要があります。makeVectorに登場しない用語は無視してくださいvectorIndex。

ドキュメントを扱うときは、実際scipy.sparseには Numpy 配列の代わりに行列を使用する必要があります。そうしないと、すぐにメモリ不足になります。

(または、Vectorizerこれらすべてを処理し、scipy.sparse行列を使用し、tf-idf 値を計算する in scikit-learn の使用を検討してください。免責事項: そのクラスの一部を書きました。)

python - Pythonでコサイン類似度を使用して、クエリドキュメントと比較して最も類似したドキュメントを返します

1 に答える 1

Related

Reference