c++ - 効率的な方法で boost::bimap を作成、アクセス、保存、ロードする

Question

以前の質問の続きとして、ビットセットをシリアル化して、同じデータに対して bimap を繰り返し作成しないようにするため、必要に応じて bimap を保存してロードします。

ハッシュ技術を使用し、検索に O(1) 操作が必要なため、データ (ビットセット) をペアでboost::bimap格納することにしました。<key,value>にはbimap4,000 万のビットセットエントリがあり、次の操作を実行する場合があります。

最小限の時間でビットセットを挿入しbimapます。私の以前の質問への回答には、より多くの時間がかかります( 2で指定されたハッシュ関数と比較すると、5 倍の 2500 万ビットセットエントリで約 5 秒)。同じ理由unordered_set_ofで andunordered_multiset_ofが使用されます。

bimap以下のハッシュ関数とは異なり、できるだけメモリの消費を抑えてコピーを避けたい。

namespace std {
    template <typename Block, typename Alloc>
    struct hash<boost::dynamic_bitset<Block, Alloc> > {

        using bitset_type = boost::dynamic_bitset<Block, Alloc>;
        using block_type = typename bitset_type::block_type ;

        size_t operator()(boost::dynamic_bitset<Block, Alloc> const& bs) const
        {
            thread_local static std::vector<block_type> block_data;
            auto blocks = bs.num_blocks();
            block_data.assign(blocks, 0);
            to_block_range(bs, block_data.begin());
            return boost::hash<std::vector<block_type>>()(block_data);
        }
    };
}

O(1) キー/値を検索します。
短時間でバイマップを読み込みます。繰り返しになりますが、bimap のロードにはかなりの時間がかかります(25 万エントリの bimap、サイズ 12 MB の場合、約 20 秒)。

したがって、回答コード@seheが以下に示されている、すでに尋ねられた質問に対して1、2、3、および4を達成したいと考えています。

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/unordered_multiset_of.hpp>
#include <boost/bimap/unordered_set_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <iostream>
#include <string>

#include <boost/iostreams/device/back_inserter.hpp>
#include <boost/iostreams/stream_buffer.hpp>
#include <boost/iostreams/stream.hpp>

#include <boost/functional/hash.hpp>

namespace serial_hashing { // see https://stackoverflow.com/questions/30097385/hash-an-arbitrary-precision-value-boostmultiprecisioncpp-int
    namespace io = boost::iostreams;

    struct hash_sink {
        hash_sink(size_t& seed_ref) : _ptr(&seed_ref) {}

        typedef char         char_type;
        typedef io::sink_tag category;

        std::streamsize write(const char* s, std::streamsize n) {
            boost::hash_combine(*_ptr, boost::hash_range(s, s+n));
            return n;
        }
      private:
        size_t* _ptr;
    };

    template <typename T> struct hash_impl {
        size_t operator()(T const& v) const {
            using namespace boost;
            size_t seed = 0;
            {
                iostreams::stream<hash_sink> os(seed);
                archive::binary_oarchive oa(os, archive::no_header | archive::no_codecvt);
                oa << v;
            }
            return seed;
        }
    };
}

namespace std {
    template <typename Block, typename Alloc> struct hash<boost::dynamic_bitset<Block, Alloc> >
        : serial_hashing::hash_impl<boost::dynamic_bitset<Block, Alloc> > 
    {};
} // namespace std

namespace bimaps = boost::bimaps;
using Bitset = boost::dynamic_bitset<>;

typedef boost::bimap<
    bimaps::unordered_set_of<Bitset, std::hash<Bitset> >,
     bimaps::unordered_multiset_of<Bitset, std::hash<Bitset> > > Index;

int main() {
    using namespace std::string_literals;

    {
        std::cout << "# Writing binary file ... " << std::endl;
        Index index;
        index.insert({Bitset("10010"s), Bitset("1010110110101010101"s)});

        std::ofstream ofs("binaryfile", std::ios::binary);
        boost::archive::binary_oarchive oa(ofs);
        oa << index;
    }

    {
        std::cout << "# Loading binary file ... " << std::endl;
        std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file

        boost::archive::binary_iarchive ia(ifs);

        Index index;
        ia >> index;
    }
}

編集

AIM 私は、たとえば2000万文字以上の大きな文字列と、長さ200文字以上の4000万から1億文字の短い文字列がある実際の例を持っています。私の目的は、これらの短い文字列を大きな文字列で検索することです。大きな文字列のビットセットを作成してからbimap、バイマップで短い文字列を検索することを考えました。unorderedとは異なり、挿入と検索を非常に高速に行うために使用することも考えましたordered。

キービットセットの長さは 3 ～ 40 程度です (一度にすべての組み合わせ)。

値のビットセットの長さは約 100 ～ 2000 です (たとえば、100 の場合は一度に 1 つだけです。すべての値のエントリは約 90 ～ 110 程度になります)。

score 1 · Accepted Answer

ビットセットを使用した順序付けられていないマップの観点から、質問全体を組み立てました。目標は何ですか？この設計でモデル化している実際の問題は何ですか?

あなたのビットセットの大きさはどれくらいですか？サイズのばらつきは？ビットの特定のサブセットの分布は? この汎用的なアプローチよりも、いくつかのことを想定して、迅速かつダーティなアドホックハッシュを使用する方がはるかに優れている可能性があります (以下の ¹ および ² を参照)。

割り当てを減らし、「パックされた」データを非ノードベースのコンテナーにロードし、要素がいつソートされるかを制御する必要があります (その不変条件を常に保持するのではなく)。

このようなコンテナを Boost Interprocess 共有メモリセグメント / メモリマップファイルに配置すると、優れた結果が得られました。

ベンチマーク

次のコードを使用して、データの生成/保存/読み込みのベンチマークを行いました。

これは、ハッシュテーブルの選択をオプトアウトすることを除いて、上記の提案のいずれも実装しないことに注意してください。キーを挿入または検索するたびにアーカイブをインスタンス化する必要がないことは、非常に役立ちます。また、負荷係数に達するとハッシュテーブルが再ハッシュされることに注意してください。チューニングは、それらが実際にスムーズに機能するための本質です。

Live On Wandbox

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/bimap.hpp>
#include <boost/bimap/multiset_of.hpp>
#include <boost/dynamic_bitset/serialization.hpp>
#include <fstream>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>

namespace bimaps = boost::bimaps;
using Block = uint32_t;
using Bitset = boost::dynamic_bitset<Block>;

typedef boost::bimap<bimaps::set_of<Bitset>, bimaps::multiset_of<Bitset>> Index;

template <typename Caption, typename F>
auto timed(Caption const& task, F&& f) {
    using namespace std::chrono;
    using namespace std::chrono_literals;
    struct _ {
        high_resolution_clock::time_point s;
        Caption const& task;
        ~_() { std::cout << " -- (" << task << " completed in " << (high_resolution_clock::now() - s) / 1.0s << "s)\n"; }
    } timing { high_resolution_clock::now(), task };

    return f();
}

int main(int argc, char**) {
    using namespace std::string_literals;

    auto gen_bitset = [
        data=std::vector<Block>(64), // max 2048 bits
        prng=std::mt19937{42} // { std::random_device{}() }
    ]() mutable {
        auto length_gen = std::uniform_int_distribution<size_t>(data.size()/2, data.size());
        auto block_gen = std::uniform_int_distribution<Block>{};

        size_t n = length_gen(prng);
        std::generate_n(data.begin(), n, [&]{ return block_gen(prng); });

        return Bitset(data.begin(), data.begin()+n);
    };

    if (argc>1) {
        std::cout << "# Creating ... " << std::endl;
        Index index;

        timed("Generating data set", [&] {
            for (size_t i = 0; i < 52<<19; ++i) {
                index.insert({gen_bitset(), gen_bitset()});
            }
        });

        timed("Writing binary file", [&] {
            std::ofstream ofs("binaryfile", std::ios::binary);
            boost::archive::binary_oarchive oa(ofs);
            oa << index;
        });
        std::cout << "Written " << index.size() << " key/value pairs\n";
    } else {
        std::cout << "# Loading ... " << std::endl;
        Index index;

        timed("Loading binary file", [&] {
            std::ifstream ifs("binaryfile", std::ios::binary); // name of loading file
            boost::archive::binary_iarchive ia(ifs);
            ia >> index;
        });

        std::cout << "Roundtripped " << index.size() << " key/value pairs\n";
    }
}

これにより、27262976 個のキーと値のペアの 11G ファイルが作成されます。すべてのキーと値は、長さが 1024 ～ 2048 ビットに均一に分散された、均一にランダムなビットセットです。

 rm binaryfile                                                                    
 time ./sotest 1                                                                  
     -- (Generating data set completed in 228.499s)
     -- (Writing binary file completed in 106.083s)
    Written 27262976 key/value pairs

    real    5m48.362s
    user    5m32.876s
    sys     0m14.704s
 ls -ltrah binaryfile 
    -rw-rw-r-- 1 sehe sehe 11G dec 14 01:16 binaryfile
 time ./sotest
    # Loading binary file ... 
     -- (Loading binary file completed in 135.167s)
    Roundtripped 27262976 key/value pairs

    real    2m19.397s
    user    2m11.624s
    sys     0m7.764s

データセットを 25 万エントリに減らすと、106MiB¹ のファイルと次のタイミングが得られます。

 rm binaryfile 
 time ./sotest 1
    # Creating ... 
     -- (Generating data set completed in 1.13267s)
     -- (Writing binary file completed in 0.586325s)
    Written 262144 key/value pairs

    real    0m1.825s
    user    0m1.676s
    sys     0m0.140s
 ls -ltrah binaryfile 
    -rw-rw-r-- 1 sehe sehe 106M dec 14 01:44 binaryfile
 time ./sotest
    # Loading ... 
     -- (Loading binary file completed in 0.776928s)
    Roundtripped 262144 key/value pairs

    real    0m0.823s
    user    0m0.764s
    sys     0m0.056s

¹これは基本的に、ビットセットがはるかに小さいことを示しています-これは、他のデータ構造の選択肢を強く支持する可能性があると思います

² 古い非スケーリングハッシュの実装が Richard Hodges によって古い回答で書かれていることに気付きました。何が起こっているか分かりますか？X を要求すると、人々があなたの問題を本当に知るには情報が少なすぎて、X に対する答えが得られます。しかし、それは最適ではありません。あなたの本当の問題は別のものです。

StackOverflow には最高のプログラマーが配置されている可能性がありますが、X/Y の問題を魔法のように見抜くことはできません。結局のところ、私たちは超能力者ではありません。あらゆる段階であなたを導くことができる先輩メンター/同僚と一緒に働くことに代わるものはありません.

c++ - 効率的な方法で boost::bimap を作成、アクセス、保存、ロードする

1 に答える 1

ベンチマーク

Related

Reference