python - defaultdict（int）のMemoryError

Question

defaultdict（int）を使用して、一連の本の単語の出現回数を記録しています。

メモリ例外が発生すると、Pythonは1.5ギガのRAMを消費します。

  File "C:\Python32\lib\collections.py", line 540, in update
    _count_elements(self, iterable)
MemoryError

私のカウンターのサイズは8,000,000を超えています。

数えるには少なくとも20,000,000のユニークな単語があります。メモリ例外を回避するにはどうすればよいですか？

score 1 · Accepted Answer

大量のメモリを備えた 64 ビットシステムを使用している場合でも、dict. データベースを使用する必要があります。

/* If we added a key, we can safely resize.  Otherwise just return!
 * If fill >= 2/3 size, adjust size.  Normally, this doubles or
 * quaduples the size, but it's also possible for the dict to shrink
 * (if ma_fill is much larger than ma_used, meaning a lot of dict
 * keys have been * deleted).
 *
 * Quadrupling the size improves average dictionary sparseness
 * (reducing collisions) at the cost of some memory and iteration
 * speed (which loops over every possible entry).  It also halves
 * the number of expensive resize operations in a growing dictionary.
 *
 * Very large dictionaries (over 50K items) use doubling instead.
 * This may help applications with severe memory constraints.
 */
if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
    return 0;
return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);

コードから、アイテムを挿入しすぎると、辞書が大きくなる必要があることがわかります-含まれているアイテムにスペースを提供するだけでなく、新しいアイテム用のスロットも提供します。dict の 2/3 以上が満たされている場合、dict のサイズは 2 倍 (または 50,000 項目未満の場合は 4 倍) になります。個人的には、ディクテーションを使用して、数十万未満のアイテムを含めます。アイテムが 100 万個に満たない場合でも、数ギガバイトしか消費せず、8 GB の win7 マシンがほとんどフリーズします。

単にアイテムを数えている場合は、次のことができます。

spilt the words in chunk
count the words in each chunk
update the database

妥当なチャンクサイズで、いくつかの db クエリを実行すると (データベースアクセスがボトルネックになると仮定して)、はるかに優れたものになります。

python - defaultdict（int）のMemoryError

1 に答える 1

Related

Reference