machine-learning - Mallet の Topic Model クラスの出力を理解する方法は?

Question

トピックモデリングデベロッパーガイドのサンプルコードを試しているので、そのコードの出力の意味を理解したいと思っています。

最初に実行プロセス中に、次のように表示されます。

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

質問1 : 最初の行の「コード化されたLDA: 10 トピック、4 トピックビット、1111 トピックマスク」とはどういう意味ですか? 「10のトピック」が何であるかしか知りません。

Question2 : " <10> LL/トークン: -9,24097 <20> LL/トークン: -9,1026 <30> LL/トークン: -8,95386 <40> LL/トークン: - 8,75353」というのは、ギブスサンプリングの指標のようです。でも、単調増加じゃないですか？

その後、次のように出力されます。

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

この部分の最初の行は、おそらくトークントピックの割り当てですよね?

Question3 : 最初のトピックについては、

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0.008は「トピック分布」と言われていますが、コーパス全体でのこのトピックの分布ですか？次に、競合があるようです。上記のトピック 0 は、そのトークンがコーパスに 8+7+6+4+4+... 回出現します。比較すると、トピック 7 では 4+3+3+3+3... 回がコーパスで認識されます。その結果、トピック 7 はトピック 0 よりも分布が低くなるはずです。これが理解できません。さらに、最後の「0 0.55」は何ですか？

この長い投稿を読んでいただき、誠にありがとうございます。あなたがそれに答えてくれることを願っており、これがマレットに興味のある他の人に役立つことを願っています.

一番

score 7 · Accepted Answer

非常に完全な答えを出すのに十分な知識があるとは思いませんが、ここにその一部を示します... Q1 では、コードを調べて、それらの値がどのように計算されるかを確認できます。Q2 の場合、LL はモデルの対数尤度をトークンの総数で割ったものです。これは、データがモデルに与えられる可能性を示す尺度です。値の増加は、モデルが改善されていることを意味します。Rこれらは、トピックモデリング用のパッケージでも利用できます。Q2, はい、一行目はその通りだと思います。Q3、良い質問です。すぐにはわかりません。おそらく (x) は何らかのインデックスであり、トークンの頻度はありそうにないようです...おそらく、これらのほとんどは何らかの診断です。

より有用な一連の診断を取得しbin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xmlて、トピックの質の多数の測定値を生成できます。彼らは間違いなくチェックする価値があります.

このすべての詳細については、MALLET の (メイン?) メンテナーである Princeton の David Mimno に電子メールを書くか、http://blog.gmane.org/gmane のリストから彼に手紙を書くことをお勧めします。 .comp.ai.mallet.develそして、MALLET の内部動作に興味がある私たちのためにここに回答を投稿します...

score 1 · Accepted Answer

質問 3 については、0.008 (「トピック分布」) は、ドキュメントのトピック分布に対する以前の \alpha に関連していると思います。Mallet はこの事前確率を最適化し、本質的にいくつかのトピックがより多くの「重み」を持つことを可能にします。Mallet は、トピック 0 がコーパスのごく一部を占めていると推定しているようです。

トークンカウントは、カウントが最も高い単語のみを表します。たとえば、トピック 0 の残りのカウントは 0 であり、トピック 9 の残りのカウントは 3 である可能性があります。したがって、トピック 9 はトピック 0 よりもコーパス内のより多くの単語を占める可能性があります。低い。

最後に「0 0.55」のコードをチェックアウトする必要がありますが、それはおそらく最適化された \beta 値です (これは非対称的に行われていないと確信しています)。

machine-learning - Mallet の Topic Model クラスの出力を理解する方法は?

3 に答える 3

Related

Reference