2

WMD paperによると、これは word2vec モデルに触発され、word2vec ベクトル空間を使用してドキュメント 1 をドキュメント 2 に移動します (Earth Mover Distance メトリックのコンテキストで)。紙から:

Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The 
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)

論文の概念は理解できましたが、Gensim のコードから wmd が word2vec 埋め込みスペースを使用する方法を理解できませんでした。

誰かが簡単に説明できますか?このコードのどこで word2vec 埋め込み行列が使用されているか理解できなかったため、別の方法で単語ベクトルを計算しますか?

Gensim の WMD 機能:

   def wmdistance(self, document1, document2):
    # Remove out-of-vocabulary words.
    len_pre_oov1 = len(document1)
    len_pre_oov2 = len(document2)
    document1 = [token for token in document1 if token in self]
    document2 = [token for token in document2 if token in self]

    dictionary = Dictionary(documents=[document1, document2])
    vocab_len = len(dictionary)

    # Sets for faster look-up.
    docset1 = set(document1)
    docset2 = set(document2)

    # Compute distance matrix.
    distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
    for i, t1 in dictionary.items():
        for j, t2 in dictionary.items():
            if t1 not in docset1 or t2 not in docset2:
                continue
            # Compute Euclidean distance between word vectors.
            distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))

    def nbow(document):
        d = zeros(vocab_len, dtype=double)
        nbow = dictionary.doc2bow(document)  # Word frequencies.
        doc_len = len(document)
        for idx, freq in nbow:
            d[idx] = freq / float(doc_len)  # Normalized word frequencies.
        return d

    # Compute nBOW representation of documents.
    d1 = nbow(document1)
    d2 = nbow(document2)

    # Compute WMD.
    return emd(d1, d2, distance_matrix)
4

1 に答える 1