python - 頻度のあるNgramのPythonリスト

Question

テキストから最も人気のあるngramを取得する必要があります。Ngramの長さは1〜5ワードである必要があります。

バイグラムとトリグラムの入手方法を知っています。例えば：

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
finder.apply_freq_filter(3)
finder.apply_word_filter(filter_stops)
matches1 = finder.nbest(bigram_measures.pmi, 20)

しかし、scikit-learnはさまざまな長さのngramを取得できることがわかりました。たとえば、長さが1〜5のngramを取得できます。

v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=5))

ただし、WordNGramAnalyzerは非推奨になりました。私の質問は、テキストからN個のベストワードコロケーションを取得する方法です。コロケーションの長さは1〜5です。また、このコロケーション/ngramのFreqListを取得する必要があります。

nltk / scikitでそれを行うことはできますか？1つのテキストからさまざまな長さのngramの組み合わせを取得する必要がありますか？

たとえば、NLTKバイグラムとトリグラムを使用します。トリグラムにビットグラムが含まれている場合や、トリグラムがより大きな4グラムの一部である場合などです。例えば：

ビットグラム：こんにちは私のトリグラム：こんにちは私の名前

トリグラムからバイグラムを除外する方法は知っていますが、より良い解決策が必要です。

score 20 · Accepted Answer

アップデート

scikit-learn 0.14以降、形式は次のように変更されました。

n_grams = CountVectorizer(ngram_range=(1, 5))

完全な例：

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

from sklearn.feature_extraction.text import CountVectorizer

c_vec = CountVectorizer(ngram_range=(1, 5))

# input to fit_transform() should be an iterable with strings
ngrams = c_vec.fit_transform([test_str1, test_str2])

# needs to happen after fit_transform()
vocab = c_vec.vocabulary_

count_values = ngrams.toarray().sum(axis=0)

# output n-grams
for ng_count, ng_text in sorted([(count_values[i],k) for k,i in vocab.items()], reverse=True):
    print(ng_count, ng_text)

Iこれは次のように出力します（単語がストップワードであるためではなく（そうではない）、その長さのために削除されることに注意してください： https ://stackoverflow.com/a/20743758/ ）：

> (3, u'to')
> (3, u'from')
> (2, u'ngrams')
> (2, u'need')
> (1, u'words')
> (1, u'trigrams but need better solutions')
> (1, u'trigrams but need better')
...

これは、最近はもっと簡単になるはずです、imo。のようなことを試すことができますが、ドキュメントの初期化など、ドキュメントに示されているように textacy現在v.0.6.2では機能しない独自の問題が発生する場合があります。ドキュメントの初期化が約束どおりに機能した場合、理論的には次のように機能します（ただし機能しません）。

test_str1 = "I need to get most popular ngrams from text. Ngrams length must be from 1 to 5 words."
test_str2 = "I know how to exclude bigrams from trigrams, but i need better solutions."

import textacy

# some version of the following line
doc = textacy.Doc([test_str1, test_str2])

ngrams = doc.to_bag_of_terms(ngrams={1, 5}, as_strings=True)
print(ngrams)

古い答え

WordNGramAnalyzerscikit-learn 0.11以降、実際には非推奨になっています。n-gramの作成と用語頻度の取得は、sklearn.feature_extraction.text.CountVectorizerで結合されるようになりました。次のように、1から5までのすべてのn-gramを作成できます。

n_grams = CountVectorizer(min_n=1, max_n=5)

その他の例と情報は、テキスト特徴抽出に関するscikit-learnのドキュメントにあります。

score 8 · Accepted Answer

生のngramを生成したい場合（そしておそらく自分で数えたい場合）、もありnltk.util.ngrams(sequence, n)ます。nの任意の値に対して一連のngramを生成します。パディングのオプションがあります。ドキュメントを参照してください。

score 4 · Accepted Answer

http://nltk.org/_modules/nltk/util.htmlを見ると、内部的にはnltk.util.bigrams（）とnltk.util.trigrams（）がnltk.util.ngrams（）を使用して実装されていると思います。

python - 頻度のあるNgramのPythonリスト

3 に答える 3

Related

Reference