python - Using Django ORM for processing huge numbers of large records

Question

I have a table containing about 30k records, that I'm attempting to iterate over and process with Django's ORM. Each record stores several binary blobs, which can each be several MB in size, that I need to process and write to a file.

However, I'm having trouble using Django for this because of memory constraints. I have 8GB of memory on my system, but after processing about 5k of records, the Python process is consuming all 8GB and gets killed by the Linux kernel. I've tried various tricks for clearing Django's query cache, like:

periodically calling MyModel.objects.update()
setting settings.DEBUG=False
periodically invoking Python's garbage collector via gc.collect()

However, none of these seem to have any noticeable effect, and my process continues to experience some sort of memory leak until it crashes.

Is there anything else I can do?

Since I only need to process each record one at a time, and I never need to access the same record again in the process, I have no need to save any model instance, or load more than one instance at a time. How do you ensure that only one record is loaded and that Django caches nothing and unallocates all memory immediately after use?

score 8 · Accepted Answer

イテレータを使用してみてください。

QuerySet は通常、その結果を内部的にキャッシュするため、評価が繰り返されても追加のクエリが発生することはありません。対照的に、 iterator() は、QuerySet レベルでのキャッシュを行わずに結果を直接読み取ります (内部的に、デフォルトの反復子は iterator() を呼び出し、戻り値をキャッシュします)。一度だけアクセスする必要がある多数のオブジェクトを返す QuerySet の場合、これによりパフォーマンスが向上し、メモリが大幅に削減されます。

これは、django ドキュメントからの引用です: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

python - Using Django ORM for processing huge numbers of large records

1 に答える 1

Related

Reference