python - Pythonで大きなリストを維持する

Question

Pythonのピクルス可能なオブジェクトの大規模なリストを維持する必要があります。リストが大きすぎてすべてをRAMに保存できないため、データベース/ページングメカニズムが必要です。このメカニズムが、リスト内の近くの（近くの）領域への高速アクセスをサポートする必要があります。

リストはすべてのpython-list機能を実装する必要がありますが、ほとんどの場合、順番に作業します。リスト内のある範囲をスキャンし、スキャン中にスキャンポイントにノードを挿入/ポップするかどうかを決定します。

リストは非常に大きくなる可能性があり（2〜3 GB）、一度にすべてをRAMに含めるべきではありません。ノードは小さい（100〜200バイト）が、さまざまなタイプのデータを含めることができます。

これに対する良い解決策は、最後にアクセスされたバケットのみがRAMにロードされるBTreeを使用することです。

複雑なインデックスキーメカニズムを実装する必要があるため、SQLテーブルの使用は適切ではありません。私のデータはテーブルではなく、特定のインデックスに要素を追加したり、特定の位置から要素をポップしたりする機能を備えた単純なPythonリストです。

ZODBデータベースファイルに保存できるBTreeベースのリストを実装するZODBとzc.blistを試しましたが、上記の機能が妥当な時間で実行されるように構成する方法がわかりません。すべてのマルチスレッド/トランザクション機能は必要ありません。私のシングルスレッドプログラムを除いて、他の誰もデータベースファイルに触れません。

上記の機能が高速に実行されるようにZODB\zc.blistを構成する方法を誰かに説明してもらえますか、それとも別の大規模なリストの実装を見せてもらえますか？

私が試したいくつかの簡単で汚いコード：

import time
import random

NODE_JUMP = 50000
NODE_ACCESS = 10000

print 'STARTING'


random_bytes = open('/dev/urandom', 'rb')

my_list = list()

nodes_no = 0

while True:
    nodes_no += NODE_JUMP
    start = time.time()
    my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
    print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)

    section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
    start = time.time()
    for index in xrange(section_start, section_start + NODE_ACCESS):
        # rotate the string
        my_list[index] = my_list[index][1:] + my_list[index][0]

    print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)

印刷は次で終了しました：

5000000ノードへの拡張には3.49秒かかりました
10000ノードへのアクセスには0.02秒かかりました
5050000ノードへの拡張には3.98秒かかりました
10000ノードへのアクセスには0.01秒かかりました
5100000ノードへの拡張には2.54秒かかりました
10000ノードへのアクセスには0.01秒かかりました
5150000ノードへの拡張には2.19秒かかりました
10000ノードへのアクセスには0.11秒かかりました
5200000ノードへの拡張には2.49秒かかりました
10000ノードへのアクセスには0.01秒かかりました
5250000ノードへの拡張には3.13秒かかりました
10000ノードへのアクセスには0.05秒かかりました
殺された（私ではない）

score 2 · Accepted Answer

結局、zc.blist を使用すると良い結果が得られます。DB を作成するときに「cache_size」オプションを設定すると、RAM に残るデータのサイズが制御されます。「transaction.commit」を頻繁に行わないと、使用される RAM のサイズが大きくなる可能性があります。大きな cache_size を定義し、transaction.commit を頻繁に実行することで、blist の最後にアクセスされたバケットが RAM に残り、それらへの高速アクセスが可能になり、使用される RAM の量が大きくなりすぎなくなります。

ただし、パッキングには非常にコストがかかりますが、大容量のハードディスクを使用している場合は、それほど頻繁にパッキングする必要はありません。

自分で試してみるコードを次に示します。バックグラウンドで「top」を実行し、cache_size を変更して、使用される RAM の量にどのように影響するかを確認します。

import time
import os
import glob
from ZODB import DB
from ZODB.FileStorage import FileStorage
import transaction
from zc.blist import BList

print('STARTING')

random = open('/dev/urandom', 'rb')


def test_list(my_list, loops = 1000, element_size = 100):
    print('testing list')
    start = time.time()
    for loop in xrange(loops):
        my_list.append(random.read(element_size))
    print('appending %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    length = len(my_list)
    print('length calculated in %.4f seconds' % (time.time() - start,))

    start = time.time()
    for loop in xrange(loops):
        my_list.insert(length / 2, random.read(element_size))
    print('inserting %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        my_list[loop] = my_list[loop][1:] + my_list[loop][0]
    print('modifying %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        del my_list[0]
    print('removing %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    transaction.commit()
    print('committing all above took %.4f seconds' % (time.time() - start,))

    del my_list[:loops]
    transaction.commit()

    start = time.time()
    pack()
    print('packing after removing %s elements took %.4f seconds' % (loops, time.time() - start))

for filename in glob.glob('database.db*'):    
    try:
        os.unlink(filename)
    except OSError:
        pass

db = DB(FileStorage('database.db'),
        cache_size = 2000)

def pack():
    db.pack()

root = db.open().root()

root['my_list'] = BList()

print('inserting initial data to blist')

for loop in xrange(10):
    root['my_list'].extend(random.read(100) for x in xrange(100000))
    transaction.commit()

transaction.commit()

test_list(root['my_list'])

score 0 · Accepted Answer

ZODBは使用するツールだと思います。多くの任意のアイテムを保存し、メモリの問題を処理します。

これが実用的な例です。この場合、相互に参照するオブジェクトと、番号によって BTree に格納されるオブジェクトを含めました。

import random
from collections import deque

import ZODB
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import BTrees

def random_string(n=100):
    return ''.join([chr(random.randint(0,95)+32) for i in xrange(n)]) 


class Node(persistent.Persistent):
   def __init__(self, value=None):
       if not value:
           self.value =  random_string()

   def setNeighbors(self, refs):
       self.p1 = refs[0]
       self.p2 = refs[1]
       self.p3 = refs[2]
       self.p4 = refs[3]


def getTree():
    storage = FileStorage('c:\\test.zdb')
    db = DB(storage)
    connection = db.open()
    root = connection.root()
    if root.has_key('tree'):
        tree = root['tree']
    else:
        tree = BTrees.OOBTree.OOBTree()
        root['tree'] = tree
        transaction.commit()
    return tree


def fillDB():
    tree = getTree()

    # start with some initial objects.
    nodes = deque([Node(), Node(), Node(), Node()])
    node = Node()

    for n in xrange(20000):
        tree[n] = node           # store the node based on a numeric ID
        node.setNeighbors(nodes) # Make the node refer to more nodes.
        node = nodes.popleft()   # maintain out list of 4 upcoming nodes.
        nodes.append(Node())
        if n % 1000 == 0:
            transaction.commit() # Must commit for data to make it to disk.
            print n
    transaction.commit()
    return tree

この時点で、tree変数は基本的に辞書のように機能し、キーでアクセスできます。ZODB BTrees API ドキュメントtree.keys(min, max)で説明されているように、を使用して範囲内のキーを取得することもできます。

rootZODB によって返されるオブジェクト内の異なるキーの下に各リストを配置することで、10 個のリストを保存できます。オブジェクトは、ZODB オブジェクトストレージへのroot「ゲートウェイ」として機能します。

ZODB のおかげで、オブジェクト間参照と Btree インデックスも使用できます。例えば：

tree = getTree()

node1 = tree[1]
print node1.p1.p1.p1.p1.p1.p1.p1.p1.p1.value

score 0 · Accepted Answer

東京キャビネットを使ってみませんか？リストのように非常に高速でシンプルですが、必要なもののために構築されています。 http://1978th.net/tokyocabinet/

python - Pythonで大きなリストを維持する

3 に答える 3

Related

Reference