python - テキストファイル内の各行の単語の頻度を数える（および書き込む）

Question

スタックに初めて投稿する-私の問題を解決するのに十分な能力のある以前の質問を常に見つけました！私が抱えている主な問題は論理です...擬似コードの答えでさえ素晴らしいでしょう。

私はPythonを使用して、テキストファイルの各行から次の形式でデータを読み込みます。

This is a tweet captured from the twitter api #hashtag http://url.com/site

nltkを使用すると、行ごとにトークン化してから、reader.sents（）を使用して反復処理を行うことができます。

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())

reader.sents()[:10]

しかし、1行あたりの特定の「ホットワード」（配列などに格納されている）の頻度を数え、それらをテキストファイルに書き戻したいと思います。reader.words（）を使用した場合、テキスト全体で「ホットワード」の頻度をカウントできますが、1行あたりの量（この場合は「文」）を探しています。

理想的には、次のようなものです。

hotwords = (['tweet'], ['twitter'])

for each line
     tokenize into words.
     for each word in line 
         if word is equal to hotword[1], hotword1 count ++
         if word is equal to hotword[2], hotword2 count ++
     at end of line, for each hotword[index]
         filewrite count,

また、URLが壊れることについてはそれほど心配していません（WordPunctTokenizerを使用すると句読点が削除されます-それは問題ではありません）

有用なポインタ（疑似または他の同様のコードへのリンクを含む）は素晴らしいでしょう。

- - 編集 - - - - - - - - -

次のようなことをすることになりました：

import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tokenize import LineTokenizer
#from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict

# Create reader and generate corpus from all txt files in dir.
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())
print "Reader accessible." 
print filereader.fileids()

#define hotwords
hotwords = ('cool','foo','bar')

tweetdict = []

for line in filereader.sents():
wordcounts = defaultdict(int)
    for word in line:
        if word in hotwords:
            wordcounts[word] += 1
    tweetdict.append(wordcounts)

出力は次のとおりです。

print tweetdict

[defaultdict(<type 'dict'>, {}),
 defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}),
 defaultdict(<type 'int'>, {'cool': 1})]

score 4 · Accepted Answer

from collections import Counter

hotwords = ('tweet', 'twitter')

lines = "a b c tweet d e f\ng h i j k   twitter\n\na"

c = Counter(lines.split())

for hotword in hotwords:
    print hotword, c[hotword]

このスクリプトはPython2.7以降で機能します

score 1 · Accepted Answer

defaultdictこの種のことのあなたの友達です。

from collections import defaultdict
for line in myfile:
    # tokenize
    word_counts = defaultdict(int)
    for word in line:
        if word in hotwords:
            word_counts[word] += 1
    print '\n'.join('%s: %s' % (k, v) for k, v in word_counts.items())

score 0 · Accepted Answer

トークン化する必要がありますか？count()あなたはあなたの言葉のそれぞれのために各行で使うことができます。

hotwords = {'tweet':[], 'twitter':[]}
for line in file_obj:
    for word in hotwords.keys():
        hotwords[word].append(line.count(word))

python - テキストファイル内の各行の単語の頻度を数える（および書き込む）

3 に答える 3

Related

Reference