python - gensim LSI のドキュメントに起因するトピックスコアを取得するにはどうすればよいですか?

Question

私はPythonとMLの初心者です。LDA の各ドキュメントに関連付けられたトピックを取得する方法に関する素敵なスクリプト ( https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/ ) を見つけました。 LSIでも使えるように変更しました。元のコードは次のとおりです。

def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()
    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

LSI に使用するために、次のように変更しました。

def format_topics_sentences_lsi(LsiModel=None, corpus=corpus, texts=data):
    """
    Extract all the information needed such as most predominant topic assigned to document and percentage of contribution
    LsiModel= model to be used
    corpus = corpus to be used
    texts = original text to be classify (for topic assignment)
    """
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(LsiModel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = LsiModel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

これは正しい方法ですか？
LSI は確率に基づいていないため、「Perc_Contrib」は 100% を超えています。この数字をどのように解釈すればよいでしょうか?
上記のスクリプトとは別に、LSI には get_document_topics がないため、どの関数を使用して最もスコアの高いトピックを表示できますか?

python - gensim LSI のドキュメントに起因するトピック スコアを取得するにはどうすればよいですか?

0 に答える 0

Related

Reference

python - gensim LSI のドキュメントに起因するトピックスコアを取得するにはどうすればよいですか?