python - Gensim LDA でのドキュメントのトピック配布

Question

次のように、おもちゃのコーパスを使用して LDA トピックモデルを導出しました。

documents = ['Human machine interface for lab abc computer applications',
             'A survey of user opinion of computer system response time',
             'The EPS user interface management system',
             'System and human system engineering testing of EPS',
             'Relation of user perceived response time to error measurement',
             'The generation of random binary unordered trees',
             'The intersection graph of paths in trees',
             'Graph minors IV Widths of trees and well quasi ordering',
             'Graph minors A survey']

texts = [[word for word in document.lower().split()] for document in documents]
dictionary = corpora.Dictionary(texts)

id2word = {}
for word in dictionary.token2id:    
    id2word[dictionary.token2id[word]] = word

少数のトピックを使用してモデルを導出すると、Gensim は、テストドキュメントのすべての潜在的なトピックに関するトピックの分布の完全なレポートを生成することがわかりました。例えば：

test_lda = LdaModel(corpus,num_topics=5, id2word=id2word)
test_lda[dictionary.doc2bow('human system')]

Out[314]: [(0, 0.59751626959781134),
(1, 0.10001902477790173),
(2, 0.10001375856907335),
(3, 0.10005453508763221),
(4, 0.10239641196758137)]

ただし、多数のトピックを使用すると、レポートが完全ではなくなります。

test_lda = LdaModel(corpus,num_topics=100, id2word=id2word)

test_lda[dictionary.doc2bow('human system')]
Out[315]: [(73, 0.50499999999997613)]

あるしきい値 (より具体的には 0.01 であることがわかりました) よりも確率が低いトピックは、出力から省略されているように思えます。

この動作は、審美的な考慮事項によるものでしょうか? また、他のすべてのトピックに対する確率質量残差の分布を取得するにはどうすればよいですか?

親切な回答ありがとうございます！

python - Gensim LDA でのドキュメントのトピック配布

2 に答える 2

Related

Reference