Problem
Pi = 3.14159 26 5358979323846 26 433... so the first 2-digit substring to repeat is 26.
What is an efficient way of finding the first 20-digit substring to repeat?
Constraints
I have about 500 gigabytes of digits of Pi (1 byte per digit), and about 500 gigabytes of disk space free.
I have about 5 gigabytes of RAM free.
I am interested in an efficient algorithm that would work for arbitrary sequences rather than the specific answer for Pi itself. In other words, I am not interested in a solution of the form "print 123....456" even if the number it prints is correct.
What I've tried
I place each substring into a hash table and report the first collision.
(The hash table is constructed as an array of sorted linked lists. The index into the array is given by the bottom digits of the string (converted to an integer), and the value stored in each node is the location in the expansion of Pi where the substring first appeared.)
This worked fine until I ran out of RAM.
To scale to longer sequences I have considered:
Generating the hash for all substrings starting in a certain range, and then continuing the search over the remainder of the digits. This needs to rescan the entire sequence of Pi for each range, so becomes order N^2
Bucket sorting the set of 20-digit substrings to multiple files, and then using a hash table to find the first repeat in each file separately. Unfortunately, with this method I run out of disk space so would need 20 passes through the data. (If I start with 1000 digits, then I will end up with 1000 20-digit substrings.)
Storing 2 digits of Pi per byte to free up more memory.
Adding a disk-based backing store to my hash table. I worry that this would behave very poorly as there is no obvious locality of reference.
Is there a better approach?
Update
I tried Adrian McCarthy's qsort method but this seemed a little slower than hashing for finding duplicates
I looked at btilly's MapReduce suggestion for parallelising the algorithm but it was heavily IO bound on my single computer so not appropriate for me (with my single disk drive)
I implemented supercat's method to spend last night splitting the files and searching for 19 digit substrings in the first 18 billion digits.
This found 16 matches so I used Jarred's suggestion to re-check the 19 digit matches to find the first 20 digit match
To search 18 billion digits took 3 hours to split the files, and 40 minutes to then rescan the files looking for matches.
Answer
The 20-digit substring 84756845106452435773 is found at positions 1,549,4062,637 and 17,601,613,330 within the decimal expansion of Pi.
Many thanks everyone!