python - 複数のファイル間で重複することなく単語の頻度を見つける方法は?

Question

フォルダー内の複数のファイル内の単語の頻度を見つけようとしています。ファイル内に単語が見つかった場合は、単語のカウントを 1 増やす必要があります。例: ファイル 1 で読み取られた場合、「終わりが良ければすべて良し」という行は、「よく」のカウントを 2 ではなく 1 ずつインクリメントする必要があります。 2になる

重複を含めずにカウンターをインクリメントする必要がありますが、私のプログラムはそれを考慮していないので、助けてください!!

import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob

def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    sorted(text)
    return text

def removeduplicates(l):
    return list(set(l))


folderpath='d:/articles-words'
counter=Counter()


filepaths = glob(os.path.join(folderpath,'*.txt'))

num_files = len(filepaths)

# Add all words to counter
for filepath in filepaths:
    with open(filepath,'r') as filehandle:
        lines = filehandle.read()
        words = removegarbage(lines).split()
        cwords=removeduplicates(words)
        counter.update(cwords)

# Display most common
for word, count in counter.most_common():

    # Break out if the frequency is less than 0.1 * the number of files
    if count < 0.1*num_files:
        break
    print('{}  {}'.format(word,count))

並べ替えと重複の削除の手法を試しましたが、それでもうまくいきません!

score 0 · Accepted Answer

私があなたの問題を正しく理解している場合、基本的には、各単語について、すべてのファイルで何回出現するかを知りたいと考えています (同じ単語が同じファイルに複数回あるかどうかに関係なく)。これを行うために、多くのファイルのリストをシミュレートする次のスキーマを作成しました（ファイル自体ではなく、プロセスを気にしただけなので、実際のリストの「ファイル」を変更する必要があるかもしれません処理したい。

d = {}
i = 0 
for f in files:
    i += 1
    for line in f:   
        words = line.split()
        for word in words:
            if word not in d:
                d[word] = {}
            d[word][i] = 1    

d2 = {}
for word,occurences in d.iteritems():
    d2[word] = sum( d[word].values() )

結果は次のようになります: {'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's" : 1}

python - 複数のファイル間で重複することなく単語の頻度を見つける方法は?

2 に答える 2

Related

Reference