i have a VERY large text file to parse (~2GB). for various reasons i have to process the file line-wise. i do this by loading the text file to memory (the server I'm running the parser on has way enough memory) with var records = Regex.Split(File.ReadAllText(dumpPath, Encoding.Default), @"my regex here").Where(s => !string.IsNullOrEmpty(s));
. this consumes RAM equivalent to the size of the text file plus a few MBs for the IEnumerable
overhead. so far so good.
then i go over the collection with foreach (var recordsd in records) {...}
here comes the interesting part. i do a lot of string manipulation and regex-ing in the foreach loop. then the program quickly bombs with an System.OutOfMemoryException, even though i never use more than a few kB in the foreach loop. i made a few memory snapshots using the profiler of my choice (ANTS memory profiler), seeing millions and millions of Generation 2 string objects on the heap, consuming all available memory.
seeing that, i - just as a test - included a GC.Collect();
at the end of each foreach iteration, and voila, problem solved and no more out of memory exceptions (sure enough because of the permanent garbage collections the program now runs painstakingly slow). The only memory consumed is the size of the actual file.
now i can't explain why this happens and how to prevent it. to my understanding, the very moment a variable goes out of scope and has no more (active) references to it should be marked for garbage collection, right?
on another side note, i tried to run the program on a really massive machine (64GB RAM). the program finished successfully but never released a single byte of memory before it was closed. why? if there are no more references to an object plus if the object goes out of scope, why is the memory never released?