I'm attempting to write a script that counts the number of occurences of a given list of tokens in a collection of files. So far I have the following:
for text in posts:
words = wordpunct_tokenize (text)
tags = nltk.pos_tag (words)
list_tags=defaultdict(int)
for a,b in tags:
tags3.append(b)
for t in tags3:
if t in tags_list:
list_tags[t]+=1
print list_tags
problem is that the program does not purge the tokens if found in the previous post, and is just counting up per post. At the last post it claims to have found 70.000 occurences of a given token in a post of 500 words.
Does anyone have an idea what I am doing wrong?