2

別のスレッドに私のものと同様の質問がありますが、再現可能なコードは除外されています。

問題のスクリプトの目標は、できるだけメモリ効率の良いプロセスを作成することです。corpus()そこで、gensims の機能を利用するクラスを作成しようとしました。ただし、作成時に解決方法がわからない IndexError が発生していlda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))ます。

私が使用しているドキュメントは、gensim チュートリアルで使用したものと同じで、tutorial_example.txt に配置しました。

$ cat tutorial_example.txt 
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey

エラーを受け取りました

$./gensim_topic_modeling.py -mn2 -w'english' -l1 tutorial_example.txt 
Traceback (most recent call last):
  File "./gensim_topic_modeling.py", line 98, in <module>
    lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 306, in __init__
    self.update(corpus)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 543, in update
    self.log_perplexity(chunk, total_docs=lencorpus)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 454, in log_perplexity
    perwordbound = self.bound(chunk, subsample_ratio=subsample_ratio) / (subsample_ratio * corpus_words)
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 630, in bound
    gammad, _ = self.inference([doc])
  File "/Users/me/anaconda/lib/python2.7/site-packages/gensim/models/ldamodel.py", line 366, in inference
    expElogbetad = self.expElogbeta[:, ids]
IndexError: index 7 is out of bounds for axis 1 with size 7

以下はgensim_topic_modeling.pyスクリプトです。

##gensim_topic_modeling.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import re
import codecs
import logging
import fileinput
from operator import *
from itertools import *
from sklearn.cluster import KMeans
from gensim import corpora, models, similarities, matutils
import argparse
from nltk.corpus import stopwords

reload(sys)
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stdin = codecs.getreader('utf-8')(sys.stdin)


##defs

def stop_word_gen():
    nltk_langs=['danish', 'dutch', 'english', 'french', 'german', 'italian','norwegian', 'portuguese', 'russian', 'spanish', 'swedish']
    stoplist = []
    for lang in options.stop_langs.split(","):
        if lang not in nltk_langs:
            sys.stderr.write('\n'+"Language {0} not supported".format(lang)+'\n')
            continue
        stoplist.extend(stopwords.words(lang))
    return stoplist


def clean_texts(texts):
    # remove tokens that appear only once
    all_tokens = sum(texts, [])
    tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
    return [[word for word in text if word not in tokens_once] for text in texts]

##class

class corpus(object):
    """sparse vector matrix and dictionary"""
    def __iter__(self):
        first=True
        for line in fileinput.FileInput(options.input, openhook=fileinput.hook_encoded("utf-8")):
            # assume there's one document per line; tokenizer option determines how to split
            if options.space_tokenizer:
                rl = re.compile('\s+', re.UNICODE).split(unicode(line,'utf-8'))
            else:
                rl = re.compile('\W+', re.UNICODE).split(tagRE.sub(' ',line)) 
            # create dictionary
            tokens=[token.strip().lower() for token in rl if token != '' and token.strip().lower() not in stoplist]
            if first:
                first=False
                self.dictionary=corpora.Dictionary([tokens])
            else:
                self.dictionary.add_documents([tokens])
                self.dictionary.compactify
            yield self.dictionary.doc2bow(tokens)


##main 

if __name__ == '__main__':
    ##parser
    parser = argparse.ArgumentParser(
                description="Topic model from a column of text.  Each line is a document in the corpus")
    parser.add_argument("input", metavar="args")
    parser.add_argument("-l", "--document-frequency-limit", dest="doc_freq_limit", default=1,
                help="Remove all tokens less than or equal to limit (default 1)")
    parser.add_argument("-m", "--create-model", dest="create_model", default=False, action="store_true",
                help="Create and save a model from existing dictionary and input corpus.")
    parser.add_argument("-n", "--number-of-topics", dest="number_of_topics", default=2,
                help="Number of topics (default 2)")
    parser.add_argument("-t", "--space-tokenizer", dest="space_tokenizer", default=False, action="store_true", 
                help="Use alternate whitespace tokenizer")
    parser.add_argument("-w", "--stop-word-languages", dest="stop_langs", default="danish,dutch,english,french,german,italian,norwegian,portuguese,russian,spanish,swedish",
                help="Desired languages for stopword lists")
    options = parser.parse_args()

    ##globals

    stoplist=set(stop_word_gen())  
    tagRE = re.compile(r'<.*?>', re.UNICODE)    # Remove xml/html tags
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename="topic-modeling-log")
    logr = logging.getLogger("topic_model")
    logr.info("#"*15 + " started " + "#"*15)

    ##instance of class 

    checker=corpus()
    logr.info("#"*15 + " SPARSE MATRIX (pre-filter)" + "#"*15)

    ##view sparse matrix and dictionary

    for vector in checker: 
        logr.info(vector)
    logr.info("#"*15 + " DICTIONARY (pre-filter)" + "#"*15)
    logr.info(checker.dictionary)
    logr.info(checker.dictionary.token2id)
    #filter
    checker.dictionary.filter_extremes(no_below=int(options.doc_freq_limit)+1)
    logr.info("#"*15 + " DICTIONARY (post-filter)" + "#"*15)
    logr.info(checker.dictionary)
    logr.info(checker.dictionary.token2id)

    ##Create lda model

    if options.create_model:     
        tfidf = models.TfidfModel(checker,normalize=False)
        print tfidf
        logr.info("#"*15 + " corpus_tfidf " + "#"*15)
        corpus_tfidf = tfidf[checker]
        logr.info("#"*15 + " lda " + "#"*15)
        lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=checker.dictionary, num_topics=int(options.number_of_topics))
        logr.info("#"*15 + " corpus_lda " + "#"*15)
        corpus_lda = lda[corpus_tfidf] 

        ##Evaluate topics based on threshold

        scores = list(chain(*[[score for topic,score in topic] \
                      for topic in [doc for doc in corpus_lda]]))
        threshold = sum(scores)/len(scores)
        print "threshold:",threshold
        print
        cluster1 = [j for i,j in zip(corpus_lda,documents) if i[0][1] > threshold]
        cluster2 = [j for i,j in zip(corpus_lda,documents) if i[1][1] > threshold]
        cluster3 = [j for i,j in zip(corpus_lda,documents) if i[2][1] > threshold]

結果のtopic-modeling-logファイルは次のとおりです。助けてくれてありがとう!

トピックモデリングログ

2014-05-25 02:58:50,482 : INFO : ############### started ###############
2014-05-25 02:58:50,483 : INFO : ############### SPARSE MATRIX (pre-filter)###############
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,483 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,483 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,483 : INFO : [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]
2014-05-25 02:58:50,483 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)]
2014-05-25 02:58:50,484 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,484 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,484 : INFO : [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (27, 1), (28, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)]
2014-05-25 02:58:50,485 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,485 : INFO : [(9, 1), (26, 1), (30, 1)]
2014-05-25 02:58:50,485 : INFO : ############### DICTIONARY (pre-filter)###############
2014-05-25 02:58:50,485 : INFO : Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,485 : INFO : {'minors': 30, 'generation': 22, 'testing': 16, 'iv': 29, 'engineering': 15, 'computer': 2, 'relation': 20, 'human': 3, 'measurement': 18, 'unordered': 25, 'binary': 21, 'abc': 0, 'ordering': 31, 'graph': 26, 'system': 10, 'machine': 6, 'quasi': 32, 'random': 23, 'paths': 28, 'error': 17, 'trees': 24, 'lab': 5, 'applications': 1, 'management': 14, 'user': 12, 'interface': 4, 'intersection': 27, 'response': 8, 'perceived': 19, 'widths': 34, 'well': 33, 'eps': 13, 'survey': 9, 'time': 11, 'opinion': 7}
2014-05-25 02:58:50,486 : INFO : keeping 12 tokens which were in no less than 2 and no more than 4 (=50.0%) documents
2014-05-25 02:58:50,486 : INFO : resulting dictionary: Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : ############### DICTIONARY (post-filter)###############
2014-05-25 02:58:50,486 : INFO : Dictionary(12 unique tokens: ['minors', 'graph', 'system', 'trees', 'eps']...)
2014-05-25 02:58:50,486 : INFO : {'minors': 0, 'graph': 1, 'system': 2, 'trees': 3, 'eps': 4, 'computer': 5, 'survey': 6, 'user': 7, 'human': 8, 'time': 9, 'interface': 10, 'response': 11}
2014-05-25 02:58:50,486 : INFO : collecting document frequencies
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,486 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,486 : INFO : PROGRESS: processing document #0
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,486 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,486 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,487 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,487 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,488 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,488 : INFO : calculating IDF weights for 9 documents and 34 features (51 matrix non-zeros)
2014-05-25 02:58:50,488 : INFO : ############### corpus_tfidf ###############
2014-05-25 02:58:50,488 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,488 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : ############### lda ###############
2014-05-25 02:58:50,489 : INFO : using symmetric alpha at 0.5
2014-05-25 02:58:50,489 : INFO : using serial LDA version on this node
2014-05-25 02:58:50,489 : WARNING : input corpus stream has no len(); counting documents
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,489 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,489 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,489 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,490 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,490 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,491 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
2014-05-25 02:58:50,491 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 50 with a convergence threshold of 0
2014-05-25 02:58:50,491 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2014-05-25 02:58:50,491 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2014-05-25 02:58:50,491 : INFO : built Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...) from 1 documents (total 7 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(7 unique tokens: ['abc', 'lab', 'machine', 'applications', 'computer']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...) from 2 documents (total 14 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(13 unique tokens: ['abc', 'system', 'lab', 'machine', 'applications']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...) from 3 documents (total 19 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(15 unique tokens: ['abc', 'management', 'system', 'lab', 'eps']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...) from 4 documents (total 25 corpus positions)
2014-05-25 02:58:50,492 : INFO : adding document #0 to Dictionary(17 unique tokens: ['abc', 'testing', 'management', 'system', 'lab']...)
2014-05-25 02:58:50,492 : INFO : built Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...) from 5 documents (total 32 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(21 unique tokens: ['measurement', 'perceived', 'abc', 'testing', 'management']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 6 documents (total 37 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(26 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...) from 7 documents (total 41 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(29 unique tokens: ['generation', 'testing', 'engineering', 'computer', 'relation']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 8 documents (total 49 corpus positions)
2014-05-25 02:58:50,493 : INFO : adding document #0 to Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...)
2014-05-25 02:58:50,493 : INFO : built Dictionary(35 unique tokens: ['minors', 'generation', 'testing', 'iv', 'engineering']...) from 9 documents (total 52 corpus positions)
4

1 に答える 1

6

これは、同じ ID から単語へのマッピングを持たないコーパス辞書を使用することが原因です。辞書を整理dictionary.compactify()して間違ったタイミングで電話をかけると、この問題が発生する可能性があります。

簡単な例がそれを明確にします。辞書を作りましょう:

from gensim.corpora.dictionary import Dictionary
documents = [
    ['here', 'is', 'one', 'document'],
    ['here', 'is', 'another', 'document'],
]
dictionary = Dictionary()
dictionary.add_documents(documents)

このディクショナリには、これらの単語のエントリがあり、整数 ID にマップされます。ドキュメントをタプルのベクトルに変換すると便利です(id, count)(これはモデルに渡す前に行いたいことです):

vectorized_corpus = [dictionary.doc2bow(doc) for doc in corpus]

辞書を変更したい場合があります。たとえば、非常にまれな、または非常に一般的な単語を削除したい場合があります。

dictionary.filter_extremes(no_below=2, no_above=0.5, keep_n=100000)
dictionary.compactify()

単語を削除すると辞書にギャップが生じますが、呼び出すとdictionary.compactify()ID が再割り当てされてギャップが埋められます。しかし、これは、vectorized_corpus上記の が と同じ id を使用しないことを意味dictionaryし、それらをモデルに渡すと、IndexError.

解決策: 変更を加えて!を呼び出した後、辞書を使用してベクトル表現を作成します。dictionary.compactify()

于 2015-12-12T02:45:06.380 に答える