python - 複数ファイルでの単語頻度計算

Question

約 20000 ファイルを含むドキュメント内の単語の出現頻度をカウントするコードを書いています。ドキュメント内の単語の全体的な頻度を取得できます。これまでのコードは次のとおりです。

import os
import re
import sys
sys.stdout=open('f2.txt','w')
from collections import Counter
from glob import iglob

def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    return text

folderpath='d:/articles-words'
counter=Counter()

for filepath in iglob(os.path.join(folderpath,'*.txt')):
    with open(filepath,'r') as filehandle:
        counter.update(removegarbage(filehandle.read()).split())

for word,count in counter.most_common():
    print('{}  {}'.format(word,count))

しかし、私は自分のカウンターを変更し、ファイルごとに 1 回だけ更新したいと考えています。つまり、カウントは、ドキュメント内のファイルでの発生または非発生に対して 0 または 1 に対応する必要があります。例：「little」という単語は、file1で3回、file45で8回発生するため、カウント値は11ではなく2である必要がありますが、現在のコードは11を示しています.

score 4 · Accepted Answer

使用sets:

for filepath in iglob(os.path.join(folderpath,'*.txt')):
    with open(filepath,'r') as filehandle:
        words = set(removegarbage(filehandle.read()).split()) 
        counter.update(words)

Asetには一意の値のみが含まれます。

>>> strs = "foo bat foo"
>>> set(strs.split())
set(['bat', 'foo'])

使用例collections.Counter：

>>> c = Counter()
>>> strs = "foo bat foo"
>>> c.update(set(strs.split()))
>>> strs = "foo spam foo"
>>> c.update(set(strs.split()))
>>> c
Counter({'foo': 2, 'bat': 1, 'spam': 1})

python - 複数ファイルでの単語頻度計算

1 に答える 1

Related

Reference