高レベルのソリューション:
- 入力を解析し、「[token] + X + Y」行を1-of-N出力ファイルに出力します(これらの「シャーディング」出力ファイルはそれぞれ、メモリ内で処理できるほど小さいです)。
- [ファイルごとに]メモリに読み込み、「[token] [count1][count2]...」行のソートされたファイルを出力します
- クエリ時に、正しいファイルでバイナリ検索を実行します
詳細:ステップ1のPython擬似コードは次のとおりです)
NUM_SHARDS = 1000 # big enough to make each file fit in memory
output_files = [open("file" + str(n), "w") for n in xrange(NUM_SHARDS)]
for token in input_stream:
shard_id = hash(token) % NUM_SHARDS
output_files[shard_id].write(token + " +0 +1\n")
# TODO: output the correct +X and +Y as needed
これがステップ2のPython擬似コードです)
input_files = [open("file" + str(n)) for n in xrange(NUM_SHARDS)]
for file in input_files:
counts = {} # Key: token Value: { "count1": 0, "count2": 1 }
# read the file, and populate 'counts'
for line in file:
(token, count1, count2) = line.split(" ")
# make sure we have a value for this token
counts.setdefault(token, { "count1": 0, "count2": 0 })
counts[token]["count1"] += int(count1)
counts[token]["count2"] += int(count2)
# TODO: compute those floats, and stuff those inside 'counts' also
# now write 'counts' out to a file (in sorted order)
output_file = open(file.name + ".index", "w")
for token, token_counts in sorted(counts.items()):
output_file.write(token + " " + token_counts["counts1"] + " " + token_counts["counts2"] + "\n")
# TODO: also write out those floats in the same line
ステップ3)のPythonコードは次のとおりです。
# assume 'token' contains the token you want to find
shard_id = hash(token) % NUM_SHARDS
filename = "file" + str(shard_id) + ".index"
binary_search(token, open(filename), 0, os.path.getsize(filename))
# print out the line in 'file' whose first token is 'token'
# begin/end always point to the start of a line
def binary_search(token, file, begin, end):
# If we're close, just do brute force
if end - begin < 10000:
file.seek(begin)
while file.tell() < end:
line = file.readline()
cur_token = line.strip().split(" ")[0]
if cur_token == token:
print line
return True
return False # not found
# If we're not close, pivot based on a line near the middle
file.seek((begin + end) / 2)
partial_line = file.readline() # ignore the first fractional line
line = file.readline()
cur_token = line.strip().split(" ")[0]
if cur_token == token:
print line
return True
elif cur_token < token:
return binary_search(token, file, file.tell(), end)
else: # cur_token > token
return binary_search(token, file, begin, file.tell() - len(line))