python - 高速な多次元データルックアップのためのデータモデルとデータストアテクノロジー

Question

キーとして文字列を持ち、子としてハッシュマップデータ構造を持つparentハッシュマップデータ構造があります（推測、、child1... child2、childN）。各子は単純なキーと値のマップであり、数値をキーとして、文字列を値として持ちます。擬似コード:

parent['key1'] = child1;    // child1 is a hash map data structure
child1[0] = 'foo';
child[1] = 'bar';
...

このデータ構造を高速ルックアップテーブルとしてデータベースシステムに実装する必要があります。Python を参照言語として考えてみましょう。

ソリューションの要件:

子の hasmap をできるだけ早く取得してください。
ハッシュのparent推定総重量は最大で 500 MB です。

ユースケースは次のとおりです。

クライアント Python プログラムは、特定の子ハッシュについてデータストアにクエリを実行します
データストアは子ハッシュを返します
Python プログラムは、ハッシュ全体を特定の関数に渡し、ハッシュから特定の値を抽出し (使用するキーは既にわかっています)、それを 2 番目の関数に渡します。

メモリ内のキー値データストア (Redis など) またはより古典的な「リレーショナル」データベースソリューションをお勧めしますか? どのデータモデルを使用することをお勧めしますか?

score 3 · Accepted Answer

絶対にRedisを使用してください。本当に速いだけでなく、必要な構造を正確に処理します: http://redis.io/commands#hash

あなたの場合、クライアントは「ハッシュから特定の値を抽出する（使用するキーをすでに知っている）」ため、「子ハッシュ」全体を読み取ることを避けることができます。

redis> HMSET myhash field1 "Hello" field2 "World"
OK
redis> HGET myhash field1
"Hello"
redis> HGET myhash field2
"World"

または、ハッシュ全体が必要な場合:

redis> HGETALL myhash
1) "field1"
2) "Hello"
3) "field2"
4) "World"
redis>

もちろん、クライアントライブラリを使用すると、実行可能なオブジェクト (あなたの場合は Python 辞書) で結果が得られます。

score 2 · Accepted Answer

redis-pyを使用するサンプルコード。Redis (および理想的にはhiredis ) が既にインストールされていると仮定し、各親をハッシュフィールドとして保存し、子をシリアル化された文字列として保存し、クライアント側でシリアル化と逆シリアル化を処理します。

JSON バージョン:

## JSON version
import json 
# you could use pickle instead, 
# just replace json.dumps/json.loads with pickle/unpickle

import redis

# set up the redis client
r = redis.StrictRedis(host = '', port = 6379, db = 0)

# sample parent dicts
parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}

# save the parents as hashfields, with the children as serialized strings
# bear in mind that JSON will convert the int keys to strings in the dumps() process
r.hmset('parent0', {key: json.dumps(parent0[key]) for key in parent0})
r.hmset('parent1', {key: json.dumps(parent0[key]) for key in parent1})


# Get a child dict from a parent
# say child1 of parent0
childstring = r.hget('parent0', 'child1') 
childdict = json.loads(childstring) 
# this could have been done in a single line... 

# if you want to convert the keys back to ints:
for key in childdict.keys():
    childdict[int(key)] = childdict[key]
    del childdict[key]

print childdict

漬物バージョン:

## pickle version
# For pickle, you need a file-like object. 
# StringIO is the native python one, whie cStringIO 
# is the c implementation of the same.
# cStringIO is faster
# see http://docs.python.org/library/stringio.html and
# http://www.doughellmann.com/PyMOTW/StringIO/ for more information
import pickle
# Find the best implementation available on this platform
try:
    from cStringIO import StringIO
except:
    from StringIO import StringIO

import redis

# set up the redis client
r = redis.StrictRedis(host = '', port = 6379, db = 0)

# sample parent dicts
parent0 = {'child0': {0:'a', 1:'b', 2:'c',}, 'child1':{5:'e', 6:'f', 7:'g'}}
parent1 = {'child0': {0:'h', 1:'i', 2:'j',}, 'child1':{5:'k', 6:'l', 7:'m'}}

# define a class with a reusable StringIO object
class Pickler(object):
    """Simple helper class to use pickle with a reusable string buffer object"""
    def __init__(self):
        self.tmpstr = StringIO()

    def __del__(self):
        # close the StringIO buffer and delete it
        self.tmpstr.close()
        del self.tmpstr

    def dump(self, obj):
        """Pickle an object and return the pickled string"""
        # empty current buffer
        self.tmpstr.seek(0,0)
        self.tmpstr.truncate(0)
        # pickle obj into the buffer
        pickle.dump(obj, self.tmpstr)
        # move the buffer pointer to the start
        self.tmpstr.seek(0,0)
        # return the pickled buffer as a string
        return self.tmpstr.read()

    def load(self, obj):
        """load a pickled object string and return the object"""
        # empty the current buffer
        self.tmpstr.seek(0,0)
        self.tmpstr.truncate(0)
        # load the pickled obj string into the buffer
        self.tmpstr.write(obj)
        # move the buffer pointer to start
        self.tmpstr.seek(0,0)
        # load the pickled buffer into an object
        return pickle.load(self.tmpstr)


pickler = Pickler()

# save the parents as hashfields, with the children as pickled strings, 
# pickled using our helper class
r.hmset('parent0', {key: pickler.dump(parent0[key]) for key in parent0})
r.hmset('parent1', {key: pickler.dump(parent1[key]) for key in parent1})


# Get a child dict from a parent
# say child1 of parent0
childstring = r.hget('parent0', 'child1') 
# this could be done in a single line... 
childdict = pickler.load(childstring) 

# we don't need to do any str to int conversion on the keys.

print childdict

score 0 · Accepted Answer

After a quick search based on Javier hint, I came up with this solution: I could implement a single parent hash in Redis, where the value fields will be the string representation of the children hashes. In this way I can quickly read them and evaluate them from the Python program.

Just to make an example, my Redis data structure will be similar to:

//write a hash with N key-value pairs: each value is an M key-value pairs hash
redis> HMSET parent_key1 child_hash "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"
  OK
redis> HMSET parent_key2 child_hash "c2k1:c2v1, c2k2:c2v2, [...], c2kM:c2vM"
  OK
[...]
redis> HMSET parent_keyN child_hash "cNk1:cNv1, cNk2:cNv2, [...], cNkM:cNvM"
  OK

//read data
redis> HGET parent_key1 child_hash
  "c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM"

Then my Python code just needs to use Redis bindings to query for the desired child hashes and have returned their actual string representations; what is left to do is turn the string representations into the corresponding dictionaries, which can therefore be looked-up at convenience.

Example code (as suggested in this answer):

>>> import ast
>>> # Redis query:
>>> #   1. Setup Redis bindings
>>> #   2. Ask for value at key: parent_key1
>>> #   3. Store the value to 's' string
>>> dictionary = ast.literal_eval('{' + s + '}')
>>> d
{c1k1:c1v1, c1k2:c1v2, [...], c1kM:c1vM}

Hope I'm not missing anything!

python - 高速な多次元データ ルックアップのためのデータ モデルとデータストア テクノロジー

3 に答える 3

Related

Reference

python - 高速な多次元データルックアップのためのデータモデルとデータストアテクノロジー