WMD paperによると、これは word2vec モデルに触発され、word2vec ベクトル空間を使用してドキュメント 1 をドキュメント 2 に移動します (Earth Mover Distance メトリックのコンテキストで)。紙から:
Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)
論文の概念は理解できましたが、Gensim のコードから wmd が word2vec 埋め込みスペースを使用する方法を理解できませんでした。
誰かが簡単に説明できますか?このコードのどこで word2vec 埋め込み行列が使用されているか理解できなかったため、別の方法で単語ベクトルを計算しますか?
def wmdistance(self, document1, document2):
# Remove out-of-vocabulary words.
len_pre_oov1 = len(document1)
len_pre_oov2 = len(document2)
document1 = [token for token in document1 if token in self]
document2 = [token for token in document2 if token in self]
dictionary = Dictionary(documents=[document1, document2])
vocab_len = len(dictionary)
# Sets for faster look-up.
docset1 = set(document1)
docset2 = set(document2)
# Compute distance matrix.
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
for j, t2 in dictionary.items():
if t1 not in docset1 or t2 not in docset2:
continue
# Compute Euclidean distance between word vectors.
distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))
def nbow(document):
d = zeros(vocab_len, dtype=double)
nbow = dictionary.doc2bow(document) # Word frequencies.
doc_len = len(document)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents.
d1 = nbow(document1)
d2 = nbow(document2)
# Compute WMD.
return emd(d1, d2, distance_matrix)