python - 累積度数、Ngram

Question

ここで簡単な質問: 以下のコードを実行すると、コーパスからリストごとのバイグラムの頻度のリストが取得されます。

ランニングタリーの合計を表示して追跡できるようにしたいと思います。インデックスが非常に小さいため、頻度を 1 または 2 として実行したときに表示されるものではなく、コーパス全体をカウントして頻度を表示します。

次に、基本的に、元のコーパスをモデル化する頻度からテキストを生成する必要があります。

   #---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project

#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown

#---------------------------------------------------------

#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'

#---------------------------------------------------------


#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
    #Simply check for an instance of a period, and if so, replace with '</s>'
    if corpus_list[-1] == '.':
        corpus_list[-1] = '</s>'
        #Stripe is a modifier that allows us to remove all special characters, IE '\n'
        corpus_list[-1].strip()
    #Else add to the end of the list item
    else:
        corpus_list.append('</s>')
    return ['<s>'] + corpus_list

#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
    print "The Corpus Annotated with <s> and </s> looks like : "
    print "Displaying [",user,"] rows of the corpus : ", '\n' 
    for corpus_list in news[:user]:
       print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
    print "Fine I Won't Show You Any... ",'\n'

#---------------------------------------------------------

print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0

#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
    passer = news[count]

    def ngrams(passer, n = 2, padding = True):
        #Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
        #'None' the first item in each list so that calculations of frequencies can be made 
        pad = [] if not padding else [None]*(n-1)
        grams = pad + passer + pad
        return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))

    #In this case, arguments are first: n-gram type (bi, tri, quad)
    #Followed by in our case the addition of 'padding'
    #Padding is used in every case here because we need it for calculations
    #This function structure allows us to pull in corpus parts without the added annotations if need be
    for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
        print '\n%d - grams || padding = %d' % (size, padding)
        print list(ngrams(passer, size, padding))

    # show frequency
    counts = defaultdict(int)
    for n_gram in ngrams(passer, 2, False):
        counts[n_gram] += 1

    print ("======================================================================================")
    print '\nFrequencies Of Bigrams:'
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
        print c, n_gram

    print '\nFrequencies Of Trigrams:'
    for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
        print c, n_gram

    count = count + 1

 #---------------------------------------------------------

score 1 · Accepted Answer

質問を理解しているかどうかわかりません。nltk には関数 generate があります。nltk の元になった本はオンラインで入手できます。

http://nltk.org/book/ch01.html

Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)

>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she

score 1 · Accepted Answer

counts問題は、文ごとに dict を新たに定義するため、ngram カウントがゼロにリセットされることです。これを while ループの上に定義すると、ブラウンコーパス全体にわたってカウントが累積されます。

おまけのアドバイス: の定義もループのngram外に移動する必要があります。同じ関数を何度も何度も定義するのは無意味です。(ただし、パフォーマンス以外には害はありません)。さらに良いことに、nltk のngram関数を使用してFreqDist、ステロイドの dict カウンターのようなものである about を読む必要があります。統計テキストの生成に取り組むときに役立ちます。

python - 累積度数、Ngram

2 に答える 2

Related

Reference