c++ - C++ マップベースのコレクションを効率的に並べ替える方法

Question

ユーザー識別子（int）を購入したさまざまな製品の数（int）にマッピングする大規模な（ish - > 100K）コレクションがあります。見つけるために、データをできるだけ効率的に再編成する必要があります異なる数の製品を持っているユーザーの数。たとえば、1 つの製品を持っているユーザーの数、2 つの製品を持っているユーザーの数などです。

元のデータを a から a に逆にすることでこれを達成しstd::mapました(キーと値が単純に逆になっています)。次に、 N 個の製品を使用しているstd::multimapユーザーの数を抽出できます(ただし、値をセットに一意に保存したので、繰り返していた値の正確な数とその順序を確認できます)count(N)

コードは次のようになります。

// uc is a std::map<int, int> containing the  original
// mapping of user identifier to the count of different
// products that they've bought.
std::set<int> uniqueCounts;
std::multimap<int, int> cu; // This maps count to user.

for ( map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    cu.insert( std::pair<int, int>( it->second, it->first ) );
    uniqueCounts.insert( it->second );
}

// Now write this out
for ( std::set<int>::const_iterator it = uniqueCounts.begin();
        it != uniqueCounts.end();  ++it )
{
    std::cout << "==> There are "
            << cu.count( *it ) << " users that have bought "
            << *it << " products(s)" << std::endl;
}

これが最も効率的な方法ではないと感じずにはいられません。これを行う賢い方法を知っている人はいますか？

Boost または C++11 を使用してこれを行うことができないという制限があります。

ああ、また、誰かが疑問に思っているかもしれませんが、これは宿題でも面接の質問でもありません.

score 4 · Accepted Answer

1 人のユーザーが購入できる製品の最大数を知っていると仮定すると、ベクトルを使用して操作の結果を格納するだけでパフォーマンスが向上する可能性があります。元のマップのほぼすべてのエントリに割り当てが必要になるため、これはおそらく最速のオプションではありません。

また、マップのルックアップオーバーヘッドを削減し、メモリの局所性を活用し、マルチマップに対するカウントの呼び出し (これは一定時間の操作ではありません) をベクトルの一定時間のルックアップに置き換えます。

したがって、次のようなことができます。

std::vector< int > uniqueCounts( MAX_PRODUCTS_PER_USER );

for ( map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    uniqueCounts[ uc.second ]++;
}

// Now write this out
for ( int i = 0, std::vector< int >::const_iterator it = uniqueCounts.begin();
        it != uniqueCounts.end();  ++it, ++i )
{
    std::cout << "==> There are "
            << *it << " users that have bought "
            << i << " products(s)" << std::endl;
}

製品の最大数がわからない場合でも、必要に応じて最大数を推測し、このコードを適応させてベクトルのサイズを大きくすることができるようです。とにかく、元の例よりも割り当てが少なくなることは間違いありません。

これはすべて、もちろんこのデータを処理した後は実際にはユーザー ID を必要としないことを前提としています (また、以下のコメントで指摘されているように、各ユーザーが購入した製品の数は比較的少なく、連続したセットです。それ以外の場合は、ベクトルの代わりにマップを使用する方がよい場合があります - multimap::count 関数を呼び出すことは避けられますが、他の利点の一部が失われる可能性があります)

score 2 · Accepted Answer

「より効率的」とはどういう意味かによって異なります。まず、これは本当にボトルネックですか？確かに、100k エントリは多いですが、数分ごとにこれを実行する必要がある場合は、アルゴリズムに数秒かかっても問題ありません。

私が見る唯一の改善点は、メモリ使用量です。これが懸念される場合は、マルチマップの生成をスキップして、次のようなカウンターマップを保持することができます (私の C++ は少し錆びていることに注意してください)。

std::map<int, int> countFrequency; // count => how many customers with that count

for ( std::map<int, int>::const_iterator it = uc.begin();
        it != uc.end();  ++it )
{
    // If it->second is not yet in countFrequency, 
    // the default constructor initializes it to 0.
    countFrequency[it->second] += 1;
}

// Now write this out
for ( std::map<int, int>::const_iterator it = countFrequency.begin();
        it != countFrequency.end();  ++it )
{
    std::cout << "==> There are "
            << it->second << " users that have bought "
            << it->first << " products(s)" << std::endl;
}

countユーザーが追加されてアイテムを購入した場合は、次のように更新countFrequencyできます

countFrequency[count] += 1;

既存のユーザーがアイテムに移動した場合はoldCount、newCount次のように更新できcountFrequencyます

countFrequency[oldCount] -= 1;
countFrequency[newCount] += 1;

余談ですが、読みやすくするために、unsigned intfor カウント (負のカウントの正当な理由がない限り) を使用し、userID型を typedef することをお勧めします。

score 1 · Accepted Answer

可能であれば、両方のデータを常に最新の状態に保つことをお勧めします。つまり、購入した製品の数をその数の製品を購入した顧客の数にマッピングする 2 番目のマップを維持します。あなたがそれを維持していれば、このマップにはあなたの質問に対する正確な答えが含まれています。顧客が製品を購入するたびに、この顧客が現在購入した製品の数を n とします。キー n-1 の値から 1 を引きます。キー n の値に 1 を加算します。キーの範囲が十分に小さい場合、これはマップではなく配列になる可能性があります。1 人の顧客が何百もの製品を購入することを期待したことがありますか?

score 1 · Accepted Answer

vectorここでは、データが小さい場合とmap、1 人のユーザーが本当にばかげた数の製品を購入した場合をカバーするために使用する混合アプローチを示します。ストアアプリで後者が本当に必要になるとは思えませんが、問題のより一般的なバージョンではそれが役立つ可能性があります。

typedef std::map<int, int> Map;
typedef Map::const_iterator It;

template <typename Container>
void get_counts(const Map &source, Container &dest) {
    for (It it = source.begin(); it != source.end(); ++it) {
        ++dest[it->second];
    }
}

template <typename Container>
void print_counts(Container &people, int max_count) {
    for (int i = 0; i <= max_count; ++i) {
        if contains(people, i) {
            std::cout << "==> There are "
                << people[i] << " users that have bought "
                << i << " products(s)" << std::endl;
        }
    }
}


// As an alternative to this overloaded contains(), you could write
// an overloaded print_counts -- after all the one above is not an 
// efficient way to iterate a sparsely-populated map. 
// Or you might prefer a template function that visits
// each entry in the container, calling a specified functor to
// will print the output, and passing it the key and value.
// This is just the smallest point of customization I thought of.
bool contains(const Map &c, int key) {
    return c.count(key);
}
bool contains(const std::vector<int, int> &c, int key) {
    // also check 0 < key < c.size() for a more general-purpose function
    return c[key]; 
}

void do_everything(const Map &uc) {
    // first get the max product count
    int max_count = 0;
    for (It it = uc.begin(); it != uc.end(); ++it) {
        max_count = max(max_count, it->second);
    }

    if (max_count > uc.size()) { // or some other threshold
        Map counts;
        get_counts(uc, counts);
        print_counts(counts, max_count);
    } else {
        std::vector<int> counts(max_count+1);
        get_counts(uc, counts);
        print_counts(counts, max_count);
    }
}

ここからリファクタリングして、カウントにaまたは aCountReOrdererのどちらを使用するかを示すテンプレートパラメーターを受け取るclass template を作成できます。vectormap

c++ - C++ マップベースのコレクションを効率的に並べ替える方法

4 に答える 4

Related

Reference