python - 辞書をファイルに保存する (numpy および Python 2/3 フレンドリー)

Question

基本的に辞書をファイルに保存することになります。つまり、他の辞書、numpy 配列、シリアライズ可能な Python オブジェクトなどを含む可能性のある、あらゆる種類の辞書構造を意味します。それだけでなく、スペースが最適化された numpy 配列を格納し、Python 2 と 3 の間でうまく機能するようにしたいと考えています。

以下は、私が知っている方法です。私の質問は、このリストに何が欠けているかということです。私のすべての契約違反をかわす代替手段はありますか?

Python のpickleモジュール (ディールブレーカー: numpy 配列のサイズを大幅に膨らませます)
Numpy のsave/ savez/ load(契約を破る: Python 2/3 間で互換性のない形式)
numpy.savez の PyTables 置換(ディールブレーカー: numpy 配列のみを処理)
PyTables を手動で使用する (ディールブレーカー: 研究コードを絶えず変更するためにこれが必要なので、単一の関数を呼び出して辞書をファイルにダンプできると非常に便利です)

numpy.savez私は hdf5 を使用するアイデアが好きで、numpy 配列を非常に効率的に圧縮するため、PyTables の置き換えは有望です。これは大きなプラスです。ただし、どのタイプの辞書構造も取りません。

最近、私が行っているのは、PyTables の置き換えに似たものを使用することですが、それを拡張して、あらゆるタイプのエントリを格納できるようにします。これは実際にはかなりうまく機能しますが、プリミティブデータ型を長さ 1 の CArray に格納していることに気付きchunksizeました。多くのスペース。

そのようなものはすでにそこにありますか？

ありがとう！

score 4 · Accepted Answer

これを 2 年前に尋ねた後、私は独自の HDF5 ベースの pickle/ の置き換えのコーディングを開始しnp.saveました。それ以来、安定したパッケージに成熟したので、最終的に自分の質問に答えて受け入れると思いました。設計上、探していたものとまったく同じだからです。

https://github.com/uchicago-cs/deepdish

score 2 · Accepted Answer

私は最近、同様の問題を抱えていることに気付きました。そのために、dict の内容を PyTables ファイルのグループに保存し、それらを dict にロードする関数をいくつか書きました。

ネストされたディクショナリとグループ構造を再帰的に処理し、PyTable でネイティブにサポートされていない型のオブジェクトをピクルして文字列配列として格納することで処理します。完璧ではありませんが、少なくとも numpy 配列などは効率的に格納されます。グループの内容を dict に読み戻すときに、巨大な構造体を誤ってメモリにロードすることを避けるためのチェックも含まれています。

import tables
import cPickle

def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
    """
    Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
    the dict must have a type and shape compatible with PyTables Array.

    If 'force == True', any existing child group of the parent node with the
    same name as the new group will be overwritten.

    If 'recursive == True' (default), new groups will be created recursively
    for any items in the dict that are also dicts.
    """
    try:
        g = f.create_group(parent, groupname)
    except tables.NodeError as ne:
        if force:
            pathstr = parent._v_pathname + '/' + groupname
            f.removeNode(pathstr, recursive=True)
            g = f.create_group(parent, groupname)
        else:
            raise ne
    for key, item in dictin.iteritems():
        if isinstance(item, dict):
            if recursive:
                dict2group(f, g, key, item, recursive=True)
        else:
            if item is None:
                item = '_None'
            f.create_array(g, key, item)
    return g


def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
    """
    Traverse a group, pull the contents of its children and return them as
    a Python dictionary, with the node names as the dictionary keys.

    If 'recursive == True' (default), we will recursively traverse child
    groups and put their children into sub-dictionaries, otherwise sub-
    groups will be skipped.

    Since this might potentially result in huge arrays being loaded into
    system memory, the 'warn' option will prompt the user to confirm before
    loading any individual array that is bigger than some threshold (default
    is 100MB)
    """

    def memtest(child, threshold=warn_if_bigger_than_nbytes):
        mem = child.size_in_memory
        if mem > threshold:
            print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
            confirm = raw_input('Load it anyway? [y/N] >>')
            if confirm.lower() == 'y':
                return True
            else:
                print "Skipping item \"%s\"..." % g._v_pathname
        else:
            return True
    outdict = {}
    for child in g:
        try:
            if isinstance(child, tables.group.Group):
                if recursive:
                    item = group2dict(f, child)
                else:
                    continue
            else:
                if memtest(child):
                    item = child.read()
                    if isinstance(item, str):
                        if item == '_None':
                            item = None
                else:
                    continue
            outdict.update({child._v_name: item})
        except tables.NoSuchNodeError:
            warnings.warn('No such node: "%s", skipping...' % repr(child))
            pass
    return outdict

Python 2/3 の相互互換性以外のすべてのボックスにチェックマークを付けるjoblib.dumpおよびについても言及する価値があります。内部では、numpy 配列やその他すべてjoblib.loadに使用されます。np.savecPickle

score 0 · Accepted Answer

これは直接的な答えではありません。とにかく、JSON にも興味があるかもしれません。13.10を見てください。JSON でサポートされていないデータ型のシリアル化。サポートされていない型の形式を拡張する方法を示します。

Mark Pilgrim による「Dive into Python 3」の章全体は、少なくとも知っている人にとっては間違いなく良い読み物です...

更新:おそらく無関係な考えですが...どこかで読んだことがありますが、異機種環境でのデータ交換に最終的に XML が採用された理由の 1 つは、特殊なバイナリ形式と圧縮された XML を比較した研究でした。あなたにとっての結論は、おそらくスペース効率の悪いソリューションを使用し、zip または別のよく知られたアルゴリズムで圧縮することです。既知のアルゴリズムを使用すると、デバッグが必要な場合に役立ちます (解凍してテキストファイルを目で確認するため)。

score 0 · Accepted Answer

ZODBのような Python オブジェクトデータベースを絶対にお勧めします。オブジェクト（文字通り好きなもの）を辞書に保存することを考えると、これはあなたの状況に非常に適しているようです。これは、辞書を辞書内に保存できることを意味します。私はこれをさまざまな問題で使用してきましたが、データベースファイル (拡張子が .fs のファイル) を誰かに渡すだけでよいという利点があります。これにより、ユーザーはそれを読み込んで、必要なクエリを実行し、自分のローカルコピーを変更できるようになります。複数のプログラムが同じデータベースに同時にアクセスしたい場合は、必ずZEOを確認してください。

始める方法のばかげた例:

from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList

# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs') 
db = DB(storage)
connection = db.open()
root = connection.root()

# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here

root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here

# add as many objects with as many characteristics as you like.

# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()

データベースが作成されると、非常に簡単に使用できます。これはオブジェクトデータベース (辞書) であるため、オブジェクトに非常に簡単にアクセスできます。

#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range'] 
'208 miles'

また、すべてのキーを表示することもできます (そして、他のすべての標準的な辞書操作を行うこともできます)。

>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']

最後に言及したいのは、キーを変更できるということです: python 辞書のキー値を変更します。値も変更できます。そのため、方法や何かを変更したために調査結果が変わった場合でも、データベース全体をゼロから開始する必要はありません (特に他のすべてがまだ問題ない場合)。これらの両方を行う場合は注意してください。キーまたは値を上書きしようとする試みを確実に認識できるように、データベースコードに安全対策を講じています。

** 追加した **

# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()

# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile

# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>

outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])

>>> print A
array([-1.        , -0.99979998, -0.99959996, ...,  0.99959996,
    0.99979998,  1.        ])

これとまったく同じ方法で、複数の numpy 配列の圧縮保存に numpy.savez() を使用することもできます。

python - 辞書をファイルに保存する (numpy および Python 2/3 フレンドリー)

5 に答える 5

Related

Reference