python - word2vec でエンコードエラーが発生する

Question

コードを実行すると、次のエラーが発生します

Traceback (most recent call last):
  File "test.py", line 21, in <module>
    print model.most_similar(positive=['男人'])
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 660, in most_similar
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word '\xe7\x94\xb7\xe4\xba\xba' not in vocabulary"

ここに私のコード

 # -*- coding: utf8 -*    
    from gensim.models import word2vec
    import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
        sentences = word2vec.Text8Corpus('/tmp/text8')
        model = word2vec.
    Word2Vec(sentences, size=200)
        model.most_similar(['男人'])

score 1 · Accepted Answer

"以下の変更で動作します。model.most_similar([u'男人'])"

つまり、Unicode 文字列ではなく、おそらく utf-8 でエンコードされた文字列を使用しているということです。入力作業を Unicode でデコードし、出力時にエンコードすることをお勧めします。

.decode('utf-8')あなたのひも

python - word2vec でエンコード エラーが発生する

1 に答える 1

Related

Reference

python - word2vec でエンコードエラーが発生する