python - コーパスを変換した後、 `gensim`のtf-idfモデルが用語とカウントを破棄したのはなぜですか？

Question

gensimコーパスを変換した後、tf-idfモデルが用語とカウントを破棄したのはなぜですか？

私のコード：

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d

出力：

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]

score 6 · Accepted Answer

IDF は、ドキュメントの総数を用語を含むドキュメントの数で割り、その商の対数を取ることによって得られます。あなたの場合、すべてのドキュメントに term0 があるため、term0 の IDF は log(1) であり、0 です。したがって、doc-term マトリックスでは、term0 の列はすべてゼロです。

すべてのドキュメントに表示される用語には重みがなく、まったく情報がありません。

python - コーパスを変換した後、 `gensim`のtf-idfモデルが用語とカウントを破棄したのはなぜですか？

1 に答える 1

Related

Reference