python - 複数のファイルでの一意の単語頻度

Question

私はpythonが初めてです。約 2000 個のテキストファイルを含むフォルダーが与えられます。各単語とその出現回数を出力することになっています (ファイル内での繰り返しなし)。たとえば、「i am what i am」という文は、ファイル内に「i」を 1 回だけ含める必要があります。

単一のファイルに対してはこれを行うことができますが、複数のファイルに対してはどのように行うのですか?

from collections import Counter
import re

def openfile(filename):
    fh = open(filename, "r+")
    str = fh.read()
    fh.close()
    return str

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

def getwordbins(words):
    cnt = Counter()
    for word in words:
        cnt[word] += 1
    return cnt

def main(filename, topwords):
    txt = openfile(filename)
    txt = removegarbage(txt)
    words = txt.split(' ')
    bins = getwordbins(words)
    for key, value in bins.most_common(topwords):
        print key,value

main('speech.txt', 500)

score 5 · Accepted Answer

glob()モジュール内のまたはiglob()関数を使用して、ファイルのリストを取得できglobます。Counterオブジェクトを効率的に使用していないことに気付きました。そのupdate()メソッドを呼び出して、単語のリストを渡す方がずっと良いでしょう。*.txt指定されたフォルダーで見つかったすべてのファイルを処理するコードの合理化されたバージョンを次に示します。

from collections import Counter
from glob import iglob
import re
import os

def remove_garbage(text):
    """Replace non-word (non-alphanumeric) chars in text with spaces,
       then convert and return a lowercase version of the result.
    """
    text = re.sub(r'\W+', ' ', text)
    text = text.lower()
    return text

topwords = 100
folderpath = 'path/to/directory'
counter = Counter()
for filepath in iglob(os.path.join(folderpath, '*.txt')):
    with open(filepath) as file:
        counter.update(remove_garbage(file.read()).split())

for word, count in counter.most_common(topwords):
    print('{}: {}'.format(count, word))

score 2 · Accepted Answer

を参照os.listdir()してください。ディレクトリ内のすべてのエントリのリストが表示されます。

http://docs.python.org/2/library/os.html#os.listdir

score 0 · Accepted Answer

したがって、次のようなことを少し行うことができます。

#!python
from __future__ import print_function
# Your code here
# ...
#

if __name__ == '__main__':
    import sys

    top=500

    if len(sys.argv) < 2:
        print ("Must supply a list of files to operate on", file=sys.stderr)
        ## For Python versions older than 2.7 use print >> sys.stderr, "..."
        sys.exit(1)

    for each in sys.argv[1:]:
        main(each, top)

任意の数のファイル名引数を使用してプログラムを呼び出すことができるように、既存の関数をラップする簡単な方法を示していることに注意してください。必要に応じて、ディレクトリを指す単一の引数を取ることもできます（そこからすべてのファイルを処理したい：

#!python
from __future__ import print_function
# Your code here
# ...
# 
if __name__ == '__main__':
    import sys

    top = 500

    if len(sys.argv) != 2:
        ## Require exactly one argument: sys.argv[0] is our own executable filename
        ## (Similarly to how class methods are called with "self" as their first
        ## argument, but on the OS layer)
        print ("Must supply directory name full of files to be processed", file=sys.stderr)
        sys.exit(1)
    for each in os.listdir(sys.argv[1]):
        main(each, top)

引数を処理したり、デフォルトの引数をハードコードしたりするために選択できる方法は他にもたくさんあります。"top" をハードコードされた値からコマンドライン引数に変更する方法は、ご想像にお任せします。追加のクレジットについては、オプション/引数解析モジュール ( argparseまたはoptparse ) を使用して、デフォルトのコマンドラインスイッチにします。

このif __name__ == ....ビジネスは、機能をアクションから分離することを奨励することにより、優れたプログラミングプラクティスを促進する Python の規則であることに注意してください。したがって、すべての機能を行の上に定義if __name__ == '__main__':し、スクリプトによって (その機能を使用して) 実行されるすべてのアクションを行の後に呼び出すことができます。これにより、ファイルを他のプログラムでモジュールとして使用できるようになりますが、独自のプログラムとして独自のユーティリティを使用することもできます。(これはほとんど Python に固有の機能ですが、Ruby は構文がわずかに異なる同様のセマンティクスのセットを実装しています)。

これにより、Python 3.x と互換性のあるセマンティクスを__future__使用する Python 2.7 を使用してプログラムを作成できます。printこれは、最終的に Python 3 に移行する場合や、言語の議論に役立つため、ステートメントとしての古い「print」を段階的に廃止し、「print()」関数の使用を促進することができます。細かいことを気にしなくても気にしないでください。違いが広範囲に及ぶことを認識してください。表示される例の大部分は古い印刷セマンティクスを使用しており、今後は少し互換性のない新しいセマンティクスを使用する必要があります。

(注: 私の最初の投稿ではfrom __future__ import、__main__セクションに__future__. Python2.x と Python3 の印刷セマンティクスの違いに行き詰まります。

python - 複数のファイルでの一意の単語頻度

4 に答える 4

Related

Reference