python - NLTK - バイグラムの頻度のカウント

Question

これは Python と NLTK の初心者向けの質問です。

一緒に 10 回以上発生し、PMI が最も高いバイグラムの頻度を見つけたいです。

このために、私はこのコードで作業しています

def get_list_phrases(text):

    tweet_phrases = []

    for tweet in text:
        tweet_words = tweet.split()
        tweet_phrases.extend(tweet_words)


    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweet_phrases,window_size = 13)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi,20)  

    for k,v in finder.ngram_fd.items():
      print(k,v)

ただし、これは結果を上位 20 に制限するものではありません。頻度が 10 未満の結果が表示されます。Python の世界は初めてです。

トップ20のみを取得するようにこれを変更する方法を誰かが指摘できますか.

ありがとうございました

score 24 · Accepted Answer

問題は、使用しようとしている方法にありますapply_freq_filter。単語のコロケーションについて議論しています。ご存じのとおり、単語のコロケーションは単語間の依存関係に関するものです。クラスはという名前のBigramCollocationFinderクラスから継承しAbstractCollocationFinder、関数apply_freq_filterはこのクラスに属します。apply_freq_filter一部の単語のコロケーションを完全に削除することは想定されていませんが、他の関数がリストにアクセスしようとした場合にフィルター処理されたコロケーションのリストを提供します。

それはなぜですか？コロケーションのフィルタリングが単にそれらを削除するだけである場合、ランダムな位置から単語を削除した後、適切に機能しない尤度比や PMI 自体 (コーパス内の他の単語に対する単語の確率を計算する) など、多くの確率尺度が存在することを想像してください。指定されたコーパスで。指定された単語のリストからいくつかのコロケーションを削除すると、多くの潜在的な機能と計算が無効になります。また、削除前にこれらすべての測定値を計算すると、ユーザーが最終的に必要としない膨大な計算オーバーヘッドが発生します。

さて、問題は ? を正しく使用する方法apply_freq_filter functionです。いくつかの方法があります。以下に、問題とその解決策を示します。

サンプルのコーパスを定義し、これまでに行ったことと同様の単語のリストに分割しましょう。

tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk

実験のために、ウィンドウサイズを 3 に設定しました。

finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)

比較のために、次のフィルターのみを使用していることに注意してくださいfinder1。

finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

今私が書くと：

for k,v in finder.ngram_fd.items():
  print(k,v)

出力は次のとおりです。

(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)

に同じように書いても同じ結果になりfinder1ます。そのため、一見するとフィルターは機能しません。ただし、それがどのように機能したかを確認してください。秘訣はを使用することscore_ngramsです。

score_ngramsonを使用するとfinder、次のようになります。

finder.score_ngrams (bigram_measures.pmi)

出力は次のとおりです。

[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]

finder1ここで、周波数 2 にフィルター処理された同じものを計算するとどうなるかに注目してください。

finder1.score_ngrams(bigram_measures.pmi)

そして出力：

[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]

頻度が 2 未満のすべてのコロケーションがこのリストに存在しないことに注意してください。まさにあなたが求めていた結果です。ということでフィルターが効きました。また、ドキュメントには、この問題に関する最小限のヒントが記載されています。

これがあなたの質問に答えたことを願っています。それ以外の場合は、お知らせください。

免責事項: 主につぶやきを扱っている場合、13 のウィンドウサイズは大きすぎます。お気付きかもしれませんが、私のサンプルコーパスでは、サンプルツイートのサイズが小さすぎたため、ウィンドウサイズ 13 を適用すると、無関係なコロケーションが検出される可能性があります。

score -2 · Accepted Answer

http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.htmlのチュートリアルを参照して、 https://en.wikipedia.org/wiki/collocationの関数NLTKと数学をさらに使用してください。 Pointwise_mutual_information . コードの質問で入力が何であるかが指定されていないため、次のスクリプトが役立つことを願っています。

# This is just a fancy way to create document. 
# I assume you have your texts in a continuous string format
# where each sentence ends with a fullstop.
>>> from itertools import chain
>>> docs = ["this is a sentence", "this is a foo bar", "you are a foo bar", "yes , i am"]
>>> texts = list(chain(*[(j+" .").split() for j in [i for i in docs]]))

# This is the NLTK part
>>> from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder>>> bigram_measures= BigramAssocMeasures()
>>> finder  BigramCollocationFinder.from_words(texts)
# This gets the top 20 bigrams according to PMI
>>> finder.nbest(bigram_measures.pmi,20)
[(',', 'i'), ('i', 'am'), ('yes', ','), ('you', 'are'), ('foo', 'bar'), ('this', 'is'), ('a', 'foo'), ('is', 'a'), ('a', 'sentence'), ('are', 'a'), ('bar', '.'), ('.', 'yes'), ('.', 'you'), ('am', '.'), ('sentence', '.'), ('.', 'this')]

PMI はを計算することで 2 つの単語の関連性を測定しますlog ( p(x|y) / p(x) )。つまり、単語の出現頻度や同時に発生する一連の単語についてだけではありません。高い PMI を達成するには、次の両方が必要です。

高い p(x|y)
低 p(x)

極端な PMI の例を次に示します。

コーパスに 100 個の単語があり、特定の単語の頻度がX1 で、別の単語とY一度だけ出現する場合、次のようになります。

p(x|y) = 1
p(x) = 1/100
PMI = log(1 / 1/100) = log 0.01 = -2

コーパスに 100 の単語があり、特定の単語の頻度が 90 であるが、別の単語では発生しない場合Y、PMI は次のようになります。

p(x|y) = 0
p(x) = 90/100
PMI = log(0 / 90/100) = log 0 = -infinity

その意味で、最初のシナリオは、2 番目の単語の頻度が非常に高いにもかかわらず、2 番目のシナリオよりも >>> X,Y 間の PMI です。

python - NLTK - バイグラムの頻度のカウント

2 に答える 2

Related

Reference