r - text2vec と topicmodels は、LDA に適したパラメーター設定で同様のトピックを生成できますか?

Question

さまざまなパッケージの結果、つまりアルゴリズムがどのように異なるのか、また同様のトピックを生成する方法でパラメーターを設定できるかどうか疑問に思っていました。特にパッケージtext2vecを見てみました。topicmodels

以下のコードを使用して、これらのパッケージで生成された 10 のトピック (用語についてはコードセクションを参照) を比較しました。同様の意味を持つ一連のトピックを生成することができませんでした。たとえば、トピック 10 fromは「警察」と関係がありますが、「警察」または類似の用語を参照するtext2vecトピックはありません。topicmodelsまた、が制作したトピック 5 のペンダントは、が制作しtopicmodelsたトピックの中で、「人生、愛、家族、戦争」に関連するものを特定できませんでしたtext2vec。

私は LDA の初心者なので、経験豊富なプログラマーにとって私の理解は素朴に聞こえるかもしれません。しかし、直観的には、結果の妥当性/堅牢性を証明するために、同様の意味を持つ一連のトピックを生成できるはずであると想定するでしょう。もちろん、必ずしもまったく同じ用語セットではありませんが、同様のトピックを扱う用語リストです。

おそらく問題は、これらの用語リストの私の人間の解釈が類似性を捉えるのに十分ではないということですが、人間の解釈の類似性を高める可能性のあるいくつかのパラメーターがあるかもしれません. これを達成するためのパラメーターを設定する方法について誰かが私を案内してくれますか、そうでなければ、問題の理解を深めるために適切なリソースに関する説明やヒントを提供できますか?

関連する可能性のあるいくつかの問題を次に示します。

text2vecは標準の Gibbs サンプリングではなく、WarpLDAを使用していることを知っています。これは、とのアルゴリズムの違いtopcimodelsです。私の理解が正しければ、事前確率alphaとdelta使用された inはそれぞれtopicmodelsasdoc_topic_priorとtopic_word_priorinに設定されtext2vecます。
さらに、後処理では、text2vec を使用してlambdaトピックの用語を頻度に基づいてソートすることができます。用語がどのようにソートされているか、まだ理解していません- 設定?topicmodelsに匹敵します。lambda=1（同様のトピックを取得せずに、0から1の間で異なるラムダを試しました）
別の問題は、設定しても完全に再現可能な例を作成するのが難しいように見えることですseed (たとえば、この質問を参照)。これは私の直接の質問ではありませんが、回答が難しくなる可能性があります。

長い質問で申し訳ありませんが、助けや提案を事前に感謝します。

Update2:最初の更新の内容を、より完全な分析に基づく回答に移動しました。

更新:text2vecパッケージ作成者Dmitriy Selivanovの有益なコメントに従って、設定lambda=1により、2 つのパッケージによって作成された用語リスト間のトピックの類似性が高まることを確認できます。

length(setdiff())さらに、トピックをすばやくチェックして、両方のパッケージで生成された用語リストの違いを詳しく調べましたlength(intersect())(以下のコードを参照)。この大まかなチェックは、text2vecトピックごとにいくつかの用語を破棄することを示しています-おそらく個々のトピックの確率のしきい値によって? topicmodelsすべてのトピックのすべての用語を保持します。これは、タームリストから (人間が) 導出できる意味の違いの一部を説明しています。

前述のように、再現可能な例を生成するのは難しいように思われるため、以下のコードのすべてのデータ例を適応させていません。実行時間が短いため、誰でも自分のシステムでチェックできます。

    library(text2vec)
    library(topicmodels)
    library(slam) #to convert dtm to simple triplet matrix for topicmodels

    ntopics <- 10
    alphaprior <- 0.1
    deltaprior <- 0.001
    niter <- 1000
    convtol <- 0.001
    set.seed(0) #for text2vec
    seedpar <- 0 #for topicmodels

    #Generate document term matrix with text2vec    
    tokens = movie_review$review[1:1000] %>% 
             tolower %>% 
             word_tokenizer

    it = itoken(tokens, ids = movie_review$id[1:1000], progressbar = FALSE)

    vocab = create_vocabulary(it) %>%
            prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)

    vectorizer = vocab_vectorizer(vocab)

    dtm = create_dtm(it, vectorizer, type = "dgTMatrix")


    #LDA model with text2vec
    lda_model = text2vec::LDA$new(n_topics = ntopics
                                  ,doc_topic_prior = alphaprior
                                  ,topic_word_prior = deltaprior
                                  )

    doc_topic_distr = lda_model$fit_transform(x =  dtm
                                              ,n_iter = niter
                                              ,convergence_tol = convtol
                                              ,n_check_convergence = 25
                                              ,progressbar = FALSE
                                              )    


    #LDA model with topicmodels
    ldatopicmodels <- LDA(as.simple_triplet_matrix(dtm), k = ntopics, method = "Gibbs",
                             LDA_Gibbscontrol = list(burnin = 100
                                                     ,delta = deltaprior
                                                     ,alpha = alphaprior
                                                     ,iter = niter
                                                     ,keep = 50
                                                     ,tol = convtol
                                                     ,seed = seedpar
                                                     ,initialize = "seeded"
                             )
    )

    #show top 15 words
    lda_model$get_top_words(n = 10, topic_number = c(1:10), lambda = 0.3)
    #       [,1]        [,2]        [,3]        [,4]       [,5]         [,6]         [,7]         [,8]      [,9]         [,10]       
    # [1,] "finally"   "men"       "know"      "video"    "10"         "king"       "five"       "our"     "child"      "cop"       
    # [2,] "re"        "always"    "ve"        "1"        "doesn"      "match"      "atmosphere" "husband" "later"      "themselves"
    # [3,] "three"     "lost"      "got"       "head"     "zombie"     "lee"        "mr"         "comedy"  "parents"    "mary"      
    # [4,] "m"         "team"      "say"       "girls"    "message"    "song"       "de"         "seem"    "sexual"     "average"   
    # [5,] "gay"       "here"      "d"         "camera"   "start"      "musical"    "may"        "man"     "murder"     "scenes"    
    # [6,] "kids"      "within"    "funny"     "kill"     "3"          "four"       "especially" "problem" "tale"       "police"    
    # [7,] "sort"      "score"     "want"      "stupid"   "zombies"    "dance"      "quality"    "friends" "television" "appears"   
    # [8,] "few"       "thriller"  "movies"    "talking"  "movies"     "action"     "public"     "given"   "okay"       "trying"    
    # [9,] "bit"       "surprise"  "let"       "hard"     "ask"        "fun"        "events"     "crime"   "cover"      "waiting"   
   # [10,] "hot"       "own"       "thinking"  "horrible" "won"        "tony"       "u"          "special" "stan"       "lewis"     
   # [11,] "die"       "political" "nice"      "stay"     "open"       "twist"      "kelly"      "through" "uses"       "imdb"      
   # [12,] "credits"   "success"   "never"     "back"     "davis"      "killer"     "novel"      "world"   "order"      "candy"     
   # [13,] "two"       "does"      "bunch"     "didn"     "completely" "ending"     "copy"       "show"    "strange"    "name"      
   # [14,] "otherwise" "beauty"    "hilarious" "room"     "love"       "dancing"    "japanese"   "new"     "female"     "low"       
   # [15,] "need"      "brilliant" "lot"       "minutes"  "away"       "convincing" "far"        "mostly"  "girl"       "killing"       

    terms(ldatopicmodels, 10)
    #      Topic 1     Topic 2   Topic 3       Topic 4   Topic 5    Topic 6       Topic 7     Topic 8      Topic 9    Topic 10
    # [1,] "show"     "where"   "horror"       "did"     "life"    "such"      "m"         "films"       "man"      "seen"       
    # [2,] "years"    "minutes" "pretty"       "10"      "young"   "character" "something" "music"       "new"      "movies"     
    # [3,] "old"      "gets"    "best"         "now"     "through" "while"     "re"        "actors"      "two"      "plot"       
    # [4,] "every"    "guy"     "ending"       "why"     "love"    "those"     "going"     "role"        "though"   "better"     
    # [5,] "series"   "another" "bit"          "saw"     "woman"   "does"      "things"    "performance" "big"      "worst"          
    # [6,] "funny"    "around"  "quite"        "didn"    "us"      "seems"     "want"      "between"     "back"     "interesting"
    # [7,] "comedy"   "nothing" "little"       "say"     "real"    "book"      "thing"     "love"        "action"   "your"       
    # [8,] "again"    "down"    "actually"     "thought" "our"     "may"       "know"      "play"        "shot"     "money"      
    # [9,] "tv"       "take"    "house"        "still"   "war"     "work"      "ve"        "line"        "together" "hard"       
    # [10,] "watching" "these"   "however"      "end"     "father"  "far"       "here"      "actor"       "against"  "poor"       
    # [11,] "cast"     "fun"     "cast"         "got"     "find"    "scenes"    "doesn"     "star"        "title"    "least"      
    # [12,] "long"     "night"   "entertaining" "2"       "human"   "both"      "look"      "never"       "go"       "say"        
    # [13,] "through"  "scene"   "must"         "am"      "shows"   "yet"       "isn"       "played"      "city"     "director"   
    # [14,] "once"     "back"    "each"         "done"    "family"  "audience"  "anything"  "hollywood"   "came"     "probably"   
    # [15,] "watched"  "dead"    "makes"        "3"       "mother"  "almost"    "enough"    "always"      "match"    "video" 

#UPDATE

#number of terms in each model is the same
length(ldatopicmodels@terms)
# [1] 2170
nrow(vocab)
# [1] 2170

#number of NA entries for termlist of first topic differs
sum(is.na(
          lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)[,1]
         )
    )
#[1] 1778

sum(is.na(
          terms(ldatopicmodels, length(ldatopicmodels@terms))
         )
   )
#[1] 0


#function to check number of terms that differ between two sets of topic collections (excluding NAs)
lengthsetdiff <- function(x, y) {

  apply(x, 2, function(i) {

    apply(y, 2, function(j) {

      length(setdiff(i[!is.na(i)],j[!is.na(j)]))
    })

  })

}


#apply the check
termstopicmodels <- terms(ldatopicmodels,length(ldatopicmodels@terms))
termstext2vec <- lda_model$get_top_words(n = nrow(vocab), topic_number = c(1:10), lambda = 1)


lengthsetdiff(termstopicmodels,
          termstopicmodels)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# Topic 1        0       0       0       0       0       0       0       0       0        0
# Topic 2        0       0       0       0       0       0       0       0       0        0
# Topic 3        0       0       0       0       0       0       0       0       0        0
# Topic 4        0       0       0       0       0       0       0       0       0        0
# Topic 5        0       0       0       0       0       0       0       0       0        0
# Topic 6        0       0       0       0       0       0       0       0       0        0
# Topic 7        0       0       0       0       0       0       0       0       0        0
# Topic 8        0       0       0       0       0       0       0       0       0        0
# Topic 9        0       0       0       0       0       0       0       0       0        0
# Topic 10       0       0       0       0       0       0       0       0       0        0

lengthsetdiff(termstext2vec,
              termstext2vec)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,]    0  340  318  335  292  309  320  355  294   322
# [2,]  355    0  321  343  292  319  311  346  302   339
# [3,]  350  338    0  316  286  309  311  358  318   322
# [4,]  346  339  295    0  297  310  301  335  309   332
# [5,]  345  330  307  339    0  310  310  354  309   333
# [6,]  350  345  318  340  298    0  311  342  308   325
# [7,]  366  342  325  336  303  316    0  364  311   325
# [8,]  355  331  326  324  301  301  318    0  311   335
# [9,]  336  329  328  340  298  309  307  353    0   314
# [10,]  342  344  310  341  300  304  299  355  292     0

lengthsetdiff(termstopicmodels,
              termstext2vec)
# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,]    1778    1778    1778    1778    1778    1778    1778    1778    1778     1778
# [2,]    1793    1793    1793    1793    1793    1793    1793    1793    1793     1793
# [3,]    1810    1810    1810    1810    1810    1810    1810    1810    1810     1810
# [4,]    1789    1789    1789    1789    1789    1789    1789    1789    1789     1789
# [5,]    1831    1831    1831    1831    1831    1831    1831    1831    1831     1831
# [6,]    1819    1819    1819    1819    1819    1819    1819    1819    1819     1819
# [7,]    1824    1824    1824    1824    1824    1824    1824    1824    1824     1824
# [8,]    1778    1778    1778    1778    1778    1778    1778    1778    1778     1778
# [9,]    1820    1820    1820    1820    1820    1820    1820    1820    1820     1820
# [10,]    1798    1798    1798    1798    1798    1798    1798    1798    1798     1798

lengthsetdiff(termstext2vec,
              termstopicmodels)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# Topic 1     0    0    0    0    0    0    0    0    0     0
# Topic 2     0    0    0    0    0    0    0    0    0     0
# Topic 3     0    0    0    0    0    0    0    0    0     0
# Topic 4     0    0    0    0    0    0    0    0    0     0
# Topic 5     0    0    0    0    0    0    0    0    0     0
# Topic 6     0    0    0    0    0    0    0    0    0     0
# Topic 7     0    0    0    0    0    0    0    0    0     0
# Topic 8     0    0    0    0    0    0    0    0    0     0
# Topic 9     0    0    0    0    0    0    0    0    0     0
# Topic 10    0    0    0    0    0    0    0    0    0     0

#also the intersection can be checked between the two sets
lengthintersect <- function(x, y) {

  apply(x, 2, function(i) {

    apply(y, 2, function(j) {

      length(intersect(i[!is.na(i)], j[!is.na(j)]))
    })

  })

}

lengthintersect(termstopicmodels,
                termstext2vec)

# Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
# [1,]     392     392     392     392     392     392     392     392     392      392
# [2,]     377     377     377     377     377     377     377     377     377      377
# [3,]     360     360     360     360     360     360     360     360     360      360
# [4,]     381     381     381     381     381     381     381     381     381      381
# [5,]     339     339     339     339     339     339     339     339     339      339
# [6,]     351     351     351     351     351     351     351     351     351      351
# [7,]     346     346     346     346     346     346     346     346     346      346
# [8,]     392     392     392     392     392     392     392     392     392      392
# [9,]     350     350     350     350     350     350     350     350     350      350
# [10,]     372     372     372     372     372     372     372     372     372      372

r - text2vec と topicmodels は、LDA に適したパラメーター設定で同様のトピックを生成できますか?

1 に答える 1

Related

Reference