python - LDA gensim の実装、2 つの異なるドキュメント間の距離

Question

編集:ここで興味深い問題を見つけました。このリンクは、gensim がトレーニングと推論の両方のステップでランダム性を使用することを示しています。したがって、ここで提案されているのは、毎回同じ結果を得るために固定シードを設定することです。しかし、なぜすべてのトピックで同じ確率が得られるのでしょうか?

私がやりたいことは、すべての Twitter ユーザーについて彼女のトピックを見つけ、トピックの類似性に基づいて Twitter ユーザー間の類似性を計算することです。gensim ですべてのユーザーに対して同じトピックを計算する可能性はありますか、それともトピックの辞書を計算し、すべてのユーザートピックをクラスター化する必要がありますか?

一般に、gensim のトピックモデル抽出に基づいて 2 人の Twitter ユーザーを比較する最良の方法はどれですか? 私のコードは次のとおりです。

   def preprocess(id): #Returns user word list (or list of user tweet)

        user_list =  user_corpus(id, 'user_'+str(id)+'.txt')
        documents = []
        for line in open('user_'+str(id)+'.txt'):
                 documents.append(line)
        #remove stop words
        lines = [line.rstrip() for line in open('stoplist.txt')]
        stoplist= set(lines)  
        texts = [[word for word in document.lower().split() if word not in stoplist]
                   for document in documents]
        # remove words that appear only once
        all_tokens = sum(texts, [])
        tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3)
        texts = [[word for word in text if word not in tokens_once]
                   for text in texts]
        words = []
        for text in texts:
            for word in text:
                words.append(word)

        return words


    words1 = preprocess(14937173)
    words2 = preprocess(15386966)
    #Load the trained model
    lda = ldamodel.LdaModel.load('tmp/fashion1.lda')
    dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict

    corpus = [dictionary.doc2bow(words1)]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    corpus_lda = lda[corpus_tfidf]

    list1 = []
    for item in corpus_lda:
      list1.append(item)

    print lda.show_topic(0)
    corpus2 = [dictionary.doc2bow(words2)]
    tfidf2 = models.TfidfModel(corpus2)
    corpus_tfidf2 = tfidf2[corpus2]
    corpus_lda2 = lda[corpus_tfidf2]

    list2 = []
    for it in corpus_lda2:
      list2.append(it)

    print corpus_lda.show_topic(0)

ユーザーコーパスの返されるトピック確率 (コーパスとしてユーザーワードのリストを使用する場合):

 [(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002),
  (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002),
  (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002),
  (9, 0.10000000000000002)]

ユーザーのツイートのリストを使用する場合、すべてのツイートについて計算されたトピックが返されます。

質問 2: 複数の Twitter ユーザーを使用して LDA モデルをトレーニングし、以前に計算された LDA モデルを使用して、すべてのユーザー (すべてのユーザーコーパス) のトピックを計算することは理にかなっていますか?

提供された例でlist[0]は、等しい確率 0.1 のトピック分布を返します。基本的に、テキストの各行は異なるツイートに対応しています。コーパスを計算するcorpus = [dictionary.doc2bow(text) for text in texts]と、すべてのツイートの確率が個別に得られます。一方、corpus = [dictionary.doc2bow(words)]例のように使用すると、すべてのユーザーワードだけがコーパスになります。2 番目のケースでは、gensim はすべてのトピックに対して同じ確率を返します。したがって、両方のユーザーに対して、同じトピックの配布を取得しています。

ユーザーのテキストコーパスは、単語のリストにするか、文章のリスト (ツイートのリスト) にするか?

264 ページのtwitterRank アプローチでの Qi He と Jianshu Weng の実装に関して、次のように述べられています。したがって、各ドキュメントは Twitterer に対応します。ドキュメントがすべてユーザーのツイートになる場合、コーパスには何を含める必要がありますか?

python - LDA gensim の実装、2 つの異なるドキュメント間の距離

2 に答える 2

Related

Reference