gensim ディクショナリ オブジェクトには、設定された量よりも少ないドキュメントに表示されるトークンを削除するための非常に優れたフィルタリング機能があります。ただし、コーパスで 1 回だけ発生するトークンを削除しようとしています。これを行うための迅速かつ簡単な方法を知っている人はいますか?
4 に答える
おそらく、質問に再現可能なコードを含める必要があります。ただし、前回の投稿のドキュメントを使用します。gensim を使用せずに目標を達成できます。
from collections import defaultdict
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
# word frequency
d=defaultdict(int)
for lister in texts:
for item in lister:
d[item]+=1
# remove words that appear only once
tokens=[key for key,value in d.items() if value>1]
texts = [[word for word in document if word in tokens] for document in texts]
ただし、いくつかの情報を追加するために、gensim チュートリアルには、前述の方法に加えて、よりメモリ効率の高い手法もあると考えるかもしれません。各ステップで何が起こっているかを確認できるように、いくつかの print ステートメントを追加しました。特定の質問は、DICTERATOR ステップで回答されます。次の回答はあなたの質問に対してやり過ぎかもしれませんが、何らかのトピックモデリングを行う必要がある場合、この情報は正しい方向への一歩となります.
$cat mycorpus.txt
Human machine interface for lab abc computer applications
A survey of user opinion of computer system response time
The EPS user interface management system
System and human system engineering testing of EPS
Relation of user perceived response time to error measurement
The generation of random binary unordered trees
The intersection graph of paths in trees
Graph minors IV Widths of trees and well quasi ordering
Graph minors A survey
次の create_corpus.py を実行します。
#!/usr/bin/env python
from gensim import corpora, models, similarities
stoplist = set('for a of the and to in'.split())
class MyCorpus(object):
def __iter__(self):
for line in open('mycorpus.txt'):
# assume there's one document per line, tokens separated by whitespace
yield dictionary.doc2bow(line.lower().split())
# TOKENIZERATOR: collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
print (dictionary)
print (dictionary.token2id)
# DICTERATOR: remove stop words and words that appear only once
stop_ids = [dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id]
once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)
print (dictionary)
print (dictionary.token2id)
dictionary.compactify() # remove gaps in id sequence after words that were removed
print (dictionary)
print (dictionary.token2id)
# VECTORERATOR: map tokens frequency per doc to vectors
corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory!
for item in corpus_memory_friendly:
print item
幸運を!
def get_term_frequency(dictionary,cutoff_freq):
"""This returns a list of tuples (term,frequency) after removing all tuples with frequency smaller than cutoff_freq
dictionary (gensim.corpora.Dictionary): corpus dictionary
cutoff_freq (int): terms with whose frequency smaller than this will be dropped
"""
tf = []
for k,v in dictionary.dfs.iteritems():
tf.append((str(dictionary.get(k)),v))
return reduce(lambda t:t[1]>cutoff_freq)
Gensimチュートリアルでこれを見つけました:
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
print texts
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
基本的に、コーパス全体を含むリストを繰り返し処理し、各単語が 1 回しか出現しない場合は、それをトークンのリストに追加します。次に、各ドキュメントの各単語を繰り返し処理し、その単語がコーパスに 1 回出現するトークンのリストに含まれていた場合は、その単語を削除します。
これが最善の方法だと思います。そうでなければ、チュートリアルで何か他のことが言及されていたでしょう。しかし、私は間違っている可能性があります。