python - 複数のファイルで既知の単語のペアの頻度を見つけてカウントする

Question

基本的に、複数のファイルの単語ペアの数を数える必要があります。というファイルに単語ペアのリストがありますresult.txt。次のようになります。

の
によって
彼らは
グループ化

特定のディレクトリにある多くのテキストファイルでこれらのペアの頻度を確認し、ペアシーケンスと対応する頻度を降順に出力したいと考えています。出力は次の形式である必要があります。

205をグループ化する
彼らは180です
56の

私はすでに次のことを試しました：

import os
import re
from collections import Counter
from glob import iglob
from collections import defaultdict
import itertools as it

folderpath = 'path/to/directory'
pairs=defaultdict(int)

logfile = open('result.txt', 'r')
loglist = logfile.readlines()
logfile.close()
found = False
for line in loglist:
    for filepath in iglob(os.path.join(folderpath,'*.txt')):
        with open(filepath,'r') as filehandle:
            for pair in it.combinations(re.findall('\w+',line),2):
                pairs[tuple(pair)]+=1
    found=True                    
resultList=[pair+(occurences, ) for pair, occurences in pairs.iterkeys()]

しかし、期待した結果が得られません。助けていただければ幸いです！

score 1 · Accepted Answer

使用combinations()すると、隣接していないものも含め、すべてのペアが取得されます。隣接するペアを返す関数を作成できます。次のコードを試してみましたが、うまくいきました。おそらく、洞察が得られるかもしれません。

import os
import re
from collections import Counter

def pairs(text):
    ans = re.findall(r'[A-Za-z]+', text)
    return (tuple(ans[i:i+2]) for i in xrange(len(ans)-1))

mypairs = tuple([ tuple(line.split()[-2:]) for line in open('results.txt')])

c = Counter()
folderpath = 'path/to/directory'
for dirpath, dnames, fnames in os.walk(folderpath):
    for f in fnames:
        if not '.txt' in f: continue
        for line in open(os.path.join(dirpath, f)):
            c += Counter(p for p in pairs(line) if p in mypairs)

for item in c.most_common():
    print item

python - 複数のファイルで既知の単語のペアの頻度を見つけてカウントする

1 に答える 1

Related

Reference