python - gensim を使用した LDA 実装の理解

Question

Python の gensim パッケージが潜在的ディリクレ割り当てを実装する方法を理解しようとしています。私は次のことをしています：

データセットを定義する

documents = ["Apple is releasing a new product", 
             "Amazon sells many things",
             "Microsoft announces Nokia acquisition"]

ストップワードを削除した後、辞書とコーパスを作成します。

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

次に、LDA モデルを定義します。

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)

次に、トピックを印刷します。

>>> lda.print_topics(5)
['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft']
2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new
2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is
2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new
2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft
>>>

この結果から多くを理解することはできません。各単語の出現確率を提供していますか? また、トピック #1、トピック #2 などの意味は何ですか? 多かれ少なかれ、最も重要なキーワードのようなものを期待していました。

gensim のチュートリアルは既に確認しましたが、あまり役に立ちませんでした。

ありがとう。

score 21 · Accepted Answer

あなたが探している答えはgensim tutorialにあります。ランダムに選択されたトピックlda.printTopics(k)に最も貢献した単語を出力します。kこれは、(部分的に) 与えられた各トピックの単語の分布であると想定できます。つまり、それらの単語がトピックの左側に現れる確率です。

通常、大規模なコーパスに対して LDA を実行します。途方もなく小さなサンプルで LDA を実行しても、最良の結果は得られません。

score 19 · Accepted Answer

このチュートリアルは、すべてを非常に明確に理解するのに役立つと思います - https://www.youtube.com/watch?v=DDq3OVp9dNA

私も最初はそれを理解するのに多くの問題に直面しました。いくつかのポイントを簡単に説明しようと思います。

潜在的ディリクレ配分では、

ドキュメントでは単語の順序は重要ではありません - Bag of Words モデル。
ドキュメントはトピックの分布です
次に、各トピックは、語彙に属する単語の分布です
LDA は確率的生成モデルです。事後分布を使用して隠れ変数を推測するために使用されます。

ドキュメントを作成するプロセスが次のようになると想像してください -

トピックよりディストリビューションを選択する
トピックを描き、トピックから単語を選択します。トピックごとにこれを繰り返します

LDA は、この線に沿ってバックトラックするようなものです。ドキュメントを表す単語の袋があるとすれば、それが表すトピックは何でしょうか?

したがって、あなたの場合、最初のトピック (0)

INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product

の割合が高くthings、値がそれほど大きくないかamazon、値が大幅に低いためです。manymicrosoftapple

このブログを読んで理解を深めることをお勧めします (エドウィン・チェンは天才です!) - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

score 11 · Accepted Answer

上記の回答が投稿されてから、を使用して LDA の直感を得るための非常に優れた視覚化ツールがいくつかありますgensim。

pyLDAvis パッケージを見てください。ここにすばらしいノートブックの概要があります。そして、これはエンドユーザー向けの非常に役立つビデオ説明です (9 分間のチュートリアル)。

お役に立てれば！

python - gensim を使用した LDA 実装の理解

5 に答える 5

Related

Reference