python - gensim 辞書にトークンを追加する方法

Question

gensimを使用して、ドキュメントのコレクションから辞書を作成します。各ドキュメントはトークンのリストです。これは私のコード

def constructModel(self, docTokens):
    """ Given document tokens, constructs the tf-idf and similarity models"""

    #construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
    #print "dictionary"
    self.dictionary = corpora.Dictionary(docTokens)

    # prune dictionary: remove words that appear too infrequently or too frequently
    print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
    #self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
    #self.dictionary.compactify()

    print "dictionary size after filter_extremes:",self.dictionary

    #construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
    corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]


    #construct the tf-idf model 
    self.model = models.TfidfModel(corpus_bow,normalize=True)
    corpus_tfidf = self.model[corpus_bow]   # first transform each raw bow vector in the corpus to the tfidf model's vector space
    self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf)  # construct the term-document index

私の質問は、この辞書に新しいドキュメント (トークン) を追加して更新する方法です。gensim ドキュメントを検索しましたが、解決策が見つかりませんでした

score 7 · Accepted Answer

gensim の Web ページでこれを行う方法に関するドキュメントがここにあります。

それを行う方法は、新しいドキュメントで別の辞書を作成し、それらをマージすることです。

from gensim import corpora

dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)

ドキュメントによると、これは「同じトークンを同じIDに、新しいトークンを新しいIDに」マッピングします。

score 2 · Accepted Answer

add_documents次の方法を使用できます。

from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)

上記のコードを実行すると、次のようになります。

Dictionary(2 unique tokens: ['aaa', 'bbb'])

詳細については、ドキュメントを参照してください。

score 0 · Accepted Answer

方法 1:

からキー付きベクトルを使用できますgensim.models.keyedvectors。それらは非常に使いやすいです。

from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)

方法 2:

を使用して既にモデルを構築している場合はgensim.models.Word2Vec、これを行うことができます。<UKN>ランダムなベクトルでトークンを追加したいとします。

model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length

完全な例は次のようになります。

import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

dataset = api.load("text8")  # load dataset as iterable
model = Word2Vec(dataset)

model.wv["<UNK>"] = np.random.rand(100)

python - gensim 辞書にトークンを追加する方法

3 に答える 3

方法 1:

方法 2:

Related

Reference