algorithm - Efficiently computing the first 20-digit substring to repeat in the decimal expansion of Pi

Question

Problem

Pi = 3.14159 26 5358979323846 26 433... so the first 2-digit substring to repeat is 26.

What is an efficient way of finding the first 20-digit substring to repeat?

Constraints

I have about 500 gigabytes of digits of Pi (1 byte per digit), and about 500 gigabytes of disk space free.
I have about 5 gigabytes of RAM free.
I am interested in an efficient algorithm that would work for arbitrary sequences rather than the specific answer for Pi itself. In other words, I am not interested in a solution of the form "print 123....456" even if the number it prints is correct.

What I've tried

I place each substring into a hash table and report the first collision.

(The hash table is constructed as an array of sorted linked lists. The index into the array is given by the bottom digits of the string (converted to an integer), and the value stored in each node is the location in the expansion of Pi where the substring first appeared.)

This worked fine until I ran out of RAM.

To scale to longer sequences I have considered:

Generating the hash for all substrings starting in a certain range, and then continuing the search over the remainder of the digits. This needs to rescan the entire sequence of Pi for each range, so becomes order N^2
Bucket sorting the set of 20-digit substrings to multiple files, and then using a hash table to find the first repeat in each file separately. Unfortunately, with this method I run out of disk space so would need 20 passes through the data. (If I start with 1000 digits, then I will end up with 1000 20-digit substrings.)
Storing 2 digits of Pi per byte to free up more memory.
Adding a disk-based backing store to my hash table. I worry that this would behave very poorly as there is no obvious locality of reference.

Is there a better approach?

Update

I tried Adrian McCarthy's qsort method but this seemed a little slower than hashing for finding duplicates
I looked at btilly's MapReduce suggestion for parallelising the algorithm but it was heavily IO bound on my single computer so not appropriate for me (with my single disk drive)
I implemented supercat's method to spend last night splitting the files and searching for 19 digit substrings in the first 18 billion digits.
This found 16 matches so I used Jarred's suggestion to re-check the 19 digit matches to find the first 20 digit match

To search 18 billion digits took 3 hours to split the files, and 40 minutes to then rescan the files looking for matches.

Answer

The 20-digit substring 84756845106452435773 is found at positions 1,549,4062,637 and 17,601,613,330 within the decimal expansion of Pi.

Many thanks everyone!

score 6 · Accepted Answer

This is an interesting problem.

First let's do some back of the envelope numbers. Any particular sequence of 20 digits, will match one time in 10²⁰. If we go out to the n'th digit we have roughly n²/2 pairs of 20 digit sequences. So to have good odds of finding a match we're going to probably need to have n a bit above 10¹⁰. Assuming that we're taking 40 bytes per record, we're going to need something on the order of 400 GB of data. (We actually need more data than this, so we should be prepared for something over a terabyte of data.)

That gives us an idea of the needed data volume. Tens of billions of digits. Hundreds of GB of data.

Now here is the problem. If we use any data structure that requires random access, random access time is set by the disk speed. Suppose that your disk goes at 6000 rpm. That's 100 times per second. On average the data you want is halfway around the disk. So you get 200 random accesses per second on average. (This can vary by hardware.) Accessing it 10 billion times is going to take 50 million seconds, which is over a year. If you read, then write, and wind up needing 20 billion data points - you're exceeding the projected lifetime of your hard drive.

The alternative is to process a batch of data in a way where you do not access randomly. The classic is to do a good external sort such a merge-sort. Suppose that we have 1 terabyte of data, which we read 30 times, write 30 times, during sorting. (Both estimates are higher than needed, but I'm painting a worst case here.) Suppose our harddrive has a sustained throughput of 100 MB/s. Then each pass takes 10,000 seconds, for 600,000 seconds, which is slightly under a week. This is very doable! (In practice it should be faster than this.)

So here is the algorithm:

Start with a long list of digits, 3141...
Turn this into a much bigger file where each line is 20 digits, followed by the location where this appears in pi.
Sort this bigger file.
Search the sorted file for any duplications.
1. If any are found, return the first.
2. If none are found, repeat steps 1-3 with another big chunk of digits.
3. Merge this into the previous sorted file.
4. Repeat this search.

Now this is great, but what if we don't want to take a week? What if we want to throw multiple machines at it? This turns out to be surprisingly easy. There are well-known distributed sorting algorithms. If we split the initial file into chunks, we can parallelize both steps 1 and 4. And if after step 4 we don't find a match, then we can just repeat from the start with a bigger input chunk.

In fact this pattern is very common. All that really varies is turning the initial data into stuff to be sorted, and then looking at matching groups. This is the http://en.wikipedia.org/wiki/MapReduce algorithm. And this will work just fine for this problem.

score 4 · Accepted Answer

Trie

RBarryYoung has pointed out that this will exceed the memory limits.

A trie data structure might be appropriate. In a single pass you can build up a trie with every prefix you've seen up to length n (e.g., n = 20). As you continue to process, if you ever reach a node at level n that already exists, you've just found a duplicate substring.

Suffix Matching

Another approach involves treating the expansion as a character string. This approach finds common suffixes, but you want common prefixes, so start by reversing the string. Create an array of pointers, with each pointer pointing to the next digit in the string. Then sort the pointers using a lexicographic sort. In C, this would be something like:

qsort(array, number_of_digits, sizeof(array[0]), strcmp);

When the qsort finishes, similar substrings will be adjacent in the pointer array. So for every pointer, you can do a limited string comparison with that string and the one pointed to by the next pointer. Again, in C:

for (int i = 1; i < number_of_digits; ++i) {
  if (strncmp(array[i - 1], array[i], 20) == 0) {
    // found two substrings that match for at least 20 digits
    // the pointers point to the last digits in the common substrings
  }
}

The sort is (typically) O(n log_2 n), and the search afterwards is O(n).

This approach was inspired by this article.

score 2 · Accepted Answer

データセットはかなり大きいので、ある種の「分割統治」が必要になります。最初のステップとして、問題をいくつかの部分（たとえば、100）に分割することをお勧めします。ファイルに00で始まる重複する20桁のシーケンスがあるかどうかを確認することから始め、次に01で始まるものがあるかどうかを確認します。99までなど。これらの「メインパス」のそれぞれを開始するには、すべてのファイルに書き込みます。正しい番号で始まる20桁のシーケンス。最初の2桁が一定の場合、最後の18桁を書き出すだけで済みます。18桁の10進数は8バイトの「長さ」に収まるため、出力ファイルはおそらく約5,000,000,000の数値を保持し、40GBのディスク容量を占有します。一度に複数の出力ファイルを作成する価値がある場合があることに注意してください。

特定の「メインパス」のデータファイルを生成したら、その中に重複があるかどうかを判断する必要があります。そこに格納されている数のビットに基づいて、それをいくつかの小さなセクションに分割することは、良い次のステップかもしれません。それを256の小さなセクションに分割すると、各セクションには約1,600万から3,200万の数字が含まれます。5ギガバイトのRAMを使用して、256個のバケットごとに100万個の数値をバッファリングできます。百万の数字の各チャンクを書き出すにはランダムなディスクシークが必要ですが、そのような書き込みの数はかなり合理的です（おそらく約10,000のディスクシーク）。

データがそれぞれ1600万から3200万の番号のファイルに分割されたら、そのような各ファイルをメモリに読み込んで、重複を探します。

説明されているアルゴリズムはおそらく最適ではありませんが、かなり近いはずです。最も興味深いのは、メインパスの数を半分に減らすと、ソースデータファイルを読み取る必要がある回数が半分になりますが、データが取得された後、各パスの処理に必要な時間は2倍以上になるという事実です。コピーされました。ソースファイルを100回通過するのはおそらく最適ではないと思いますが、その分割係数を使用するプロセス全体に必要な時間は、最適な分割係数を使用する時間にかなり近いでしょう。

score 2 · Accepted Answer

Perhaps something like this will work:

Search for repeated substrings of length 2 (or some small base case), record starting indicies S={s_i}
For n=3..N, look for substrings of length n from the indices in S
Each iteration, update S with substrings of length n
at n=20, the first two indicies will be your answer

you might want to adjust the initial size and step size (it might not be necessary to step by 1 each time)

algorithm - Efficiently computing the first 20-digit substring to repeat in the decimal expansion of Pi

Problem

Constraints

What I've tried

Update

Answer

4 に答える 4

Related

Reference