python - Pythonでの1Gbテキストファイルの単語頻度計算

Question

約 2 億 300 万語であるサイズ 1.2 GB のテキストファイルの単語頻度を計算しようとしています。次の Python コードを使用しています。しかし、それは私にメモリエラーを与えています。これに対する解決策はありますか？

これが私のコードです：

import re
# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'inputfile.txt'
# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1

print 'Unique words:', len(freq_dic)
# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
    print word, freq

そして、これが私が受け取ったエラーです：

Traceback (most recent call last):
  File "count.py", line 6, in <module>
    word_list = re.split('\s+', file(filename).read().lower())
  File "/usr/lib/python2.7/re.py", line 167, in split
    return _compile(pattern, flags).split(string, maxsplit)
MemoryError

score 16 · Accepted Answer

問題はここから始まります。

file(filename).read()

これにより、ファイル全体が文字列に読み込まれます。代わりに、ファイルを行ごとまたはチャンクごとに処理する場合、メモリの問題が発生することはありません。

with open(filename) as f:
    for line in f:

また、 collections.Counterを使用して単語の頻度をカウントすることもできます。

In [1]: import collections

In [2]: freq = collections.Counter()

In [3]: line = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod'

In [4]: freq.update(line.split())

In [5]: freq
Out[5]: Counter({'ipsum': 1, 'amet,': 1, 'do': 1, 'sit': 1, 'eiusmod': 1, 'consectetur': 1, 'sed': 1, 'elit,': 1, 'dolor': 1, 'Lorem': 1, 'adipisicing': 1})

そして、さらにいくつかの単語を数えるために、

In [6]: freq.update(line.split())

In [7]: freq
Out[7]: Counter({'ipsum': 2, 'amet,': 2, 'do': 2, 'sit': 2, 'eiusmod': 2, 'consectetur': 2, 'sed': 2, 'elit,': 2, 'dolor': 2, 'Lorem': 2, 'adipisicing': 2})

Acollections.Counterはのサブクラスでdictあるため、すでに慣れ親しんだ方法で使用できます。さらに、most_commonなどのカウントに役立つメソッドがいくつかあります。

score 5 · Accepted Answer

問題は、ファイル全体をメモリに読み込もうとしていることです。解決策は、ファイルを1行ずつ読み取り、各行の単語を数え、結果を合計します。

python - Pythonでの1Gbテキストファイルの単語頻度計算

2 に答える 2

Related

Reference