python - Python:複数行にわたる単語の一意のインスタンスをカウントする

Question

いくつかの観察結果を含むテキストファイルがあります。各観測値は 1 行に表示されます。行内の各単語の一意の出現を検出したいと思います。つまり、同じ単語が同じ行に 2 回以上出現しても、1 回としてカウントされます。ただし、すべての観察で各単語の出現頻度を数えたいと思います。これは、ある単語が 2 行以上にまたがって出現する場合、その単語が出現した行数を数えたいということです。これが私が書いたプログラムで、多数のファイルの処理が非常に遅いです。また、別のファイルを参照して、ファイル内の特定の単語を削除します。速度を改善する方法についての提案を提供してください。ありがとうございました。

import re, string
from itertools import chain, tee, izip
from collections import defaultdict

def count_words(in_file="",del_file="",out_file=""):

    d_list = re.split('\n', file(del_file).read().lower())
    d_list = [x.strip(' ') for x in d_list] 

    dict2={}
    f1 = open(in_file,'r')
    lines = map(string.strip,map(str.lower,f1.readlines()))

    for line in lines:
        dict1={}
        new_list = []
        for char in line:
            new_list.append(re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", char))
        s=''.join(new_list)
        for word in d_list:
            s = s.replace(word,"")
        for word in s.split():
            try:
                dict1[word]=1
            except:
                dict1[word]=1
        for word in dict1.keys():
            try:
                dict2[word] += 1
            except:
                dict2[word] = 1
    freq_list = dict2.items()
    freq_list.sort()
    f1.close()

    word_count_handle = open(out_file,'w+')
    for word, freq  in freq_list:
        print>>word_count_handle,word, freq
    word_count_handle.close()
    return dict2

 dict = count_words("in_file.txt","delete_words.txt","out_file.txt")

score 1 · Accepted Answer

パフォーマンステストを行っていない場合、次のことが思い浮かびます。

1）正規表現を使用しています-なぜですか？特定のキャラクターを排除しようとしているだけですか？

2)フロー制御に例外を使用している - Pythonic である可能性がありますが (許可よりも許しを求めるほうがよい)、例外のスローは多くの場合遅くなる可能性があります。ここに見られるように：

    for word in dict1.keys():
        try:
            dict2[word] += 1
        except:
            dict2[word] = 1

3)d_listセットに変換し、python を使用してinメンバーシップをテストし、同時に ...

4)文字列でのメソッドの多用を避けるreplace- に現れる単語を除外するためにこれを使用していると思いますd_list。これは、代わりにを回避replaceし、リスト内包表記を使用して行内の単語をフィルタリングすることで実現できます。

[word for word words if not word in del_words]

またはフィルターを使用して（あまりpythonicではありません）：

filter(lambda word: not word in del_words, words)

score 1 · Accepted Answer

一度に 1 つずつ、行の各文字に対してre.sub を実行しています。それは遅いです。行全体でそれを行います：

s = re.sub(r'[0-9#$?*_><@\(\)&;:,.!-+%=\[\]\-\/\^]', "_", line)

また、コレクションモジュールのセットと Counter クラスを見てください。数えて、後で不要なものを破棄すると、より高速になる場合があります。

score 0 · Accepted Answer

import re

u_words        = set()
u_words_in_lns = []
wordcount      = {}
words          = []

# get unique words per line
for line in buff.split('\n'):
    u_words_in_lns.append(set(line.split(' ')))

# create a set of all unique words
map( u_words.update, u_words_in_lns )

# flatten the sets into a single list of words again
map( words.extend, u_words_in_lns)

# count everything up
for word in u_words:
    wordcount[word] = len(re.findall(word,str(words)))

python - Python:複数行にわたる単語の一意のインスタンスをカウントする

3 に答える 3

Related

Reference