mongodb - MongoDB Aggregation Framework のパフォーマンスが数百万のドキュメントで遅くなる

Question

バックグラウンド

私たちのシステムはキャリアグレードで非常に堅牢で、1 秒あたり 5000 トランザクションを処理する負荷テストが行われています。トランザクションごとに、単一の MongoDB コレクションにドキュメントが挿入されます (このアプリケーションでは更新やクエリは行われず、書き込み専用です)。これは、1 日あたり約 7 億ドキュメントに相当し、これが私たちのベンチマークです。

MongoDB デプロイメントはまだ分割されていません。1 つのマスターと 2 つのスレーブを含む 1 つのレプリカセットがあり、そのすべてが ec2 のタイプ m2.2xlarge インスタンスです。各インスタンスは、8 つのボリューム (PIOPS なし) で構成される 1TB RAID0 ストライプによってサポートされます。書き込みパフォーマンスを最適化するために、node-mongodb-native ドライバーと C++ ネイティブ BSON パーサーを使用して、それに応じてドキュメント構造をモデル化しようとしました。

ノート

ドキュメントが小さい (120 バイト)
ドキュメントには、「time [ime]」フィールドとともに「タイムバケット」（h [our]、d [ay]、month [onth]、y [ear]）が含まれています
コレクションには、「c[ustomer]」と「a」でクエリするためのインデックスがあります。これは非常にランダムですが、一意ではないタグです。
データを個別のコレクションに分割することを検討しましたが、この例ではすべてのデータがホットです。
事前集計も検討していますが、これはリアルタイムではできません。

要件

レポートを作成するには、月ごとの一意の「a」タグの量と、特定の期間の顧客別の合計を計算する必要があります
レポートは、2 時間以上保存された 950 万件のドキュメントのサンプル (完全なコレクション) を実行するのに約 60 秒かかります。以下の詳細:

資料

{
  _id: ObjectID(),
  a: ‘string’,
  b: ‘string’,
  c: ‘string’ or <int>,
  g: ‘string’ or <not_exist>,
  t: ISODate(),
  h: <int>,
  d: <int>,
  m: <int>,
  y: <int>
}

索引

col.ensureIndex({ c: 1, a: 1, y: 1, m: 1, d: 1, h: 1 });

集計クエリ

col.aggregate([
    { $match: { c: 'customer_1', y: 2013, m: 11 } },
    { $group: { _id: { c: '$c', y: '$y', m: '$m' }, a: { $addToSet: '$a' }, t: { $sum: 1 } } },
    { $unwind: '$a' },
    { $group: { _id: { c: '$_id.c', y: '$_id.y', m: '$_id.m', t: '$t' }, a: { $sum: 1 } } },
    { $sort: { '_id.m': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            m: '$_id.m',
            a: 1,
            t: '$_id.t'
        }
    },
    { $group: { _id: { c: '$c', y: '$y' }, monthly: { $push: { m: '$m', a: '$a', t: '$t' } } } },
    { $sort: { '_id.y': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            y: '$_id.y', 
            monthly: 1
        }
    },
    { $group: { _id: { c: '$c' }, yearly: { $push: { y: '$y', monthly: '$monthly' } } } },
    { $sort: { '_id.c': 1 } },
    {
        $project: {
            _id: 0,
            c: '$_id.c',
            yearly: 1
        }
    }    
]);

集計結果

[
    {
        "yearly": [
            {
                "y": 2013,
                "monthly": [
                    {
                        "m": 11,
                        "a": 3465652,
                        "t": 9844935
                    }
                ]
            }
        ],
        "c": "customer_1"
    }
]

63181ms

集計説明

{
        "cursor" : "BtreeCursor c_1_a_1_y_1_m_1_d_1_h_1",
        "isMultiKey" : false,
        "n" : 9844935,
        "nscannedObjects" : 0,
        "nscanned" : 9844935,
        "nscannedObjectsAllPlans" : 101,
        "nscannedAllPlans" : 9845036,
        "scanAndOrder" : false,
        "indexOnly" : true,
        "nYields" : 27,
        "nChunkSkips" : 0,
        "millis" : 32039,
        "indexBounds" : {
                "c" : [ [ "customer_1", "customer_1" ] ],
                "a" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "y" : [ [ 2013, 2013 ] ],
                "m" : [ [ 11, 11 ] ],
                "d" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ],
                "h" : [ [ { "$minElement" : 1 }, { "$maxElement" : 1 } ] ]
        }
}

質問

挿入の頻度が高く、時間の経過とともに範囲の集計クエリを実行する必要があることを考えると. アプリケーションが 1 時間に 3,000 万件のドキュメントを挿入できることを考えると、時間バケットは適切な方法ですか?
私たちは、MongoDB が数十億のドキュメントを数秒でクエリできることを理解していました。
- 950 万ドキュメントを超える集計クエリは、1 秒程度で返されるのでしょうか?
- これを達成するために適切な手法を使用していますか?そうでない場合、レポート結果をほぼ瞬時に取得するためにどこに注力すべきでしょうか?
- この段階でシャーディングなしで可能ですか?
MapReduce (並列) はより良い代替手段でしょうか?

score 0 · Accepted Answer

y、m、および d (年、月、日付ですか?) のインデックスをその順序で試すことをお勧めします。これらは int であることがわかっているため、現在の c は int または string である可能性があります。 . データは時間ベースであるため、同様に理にかなっている可能性があります。

score 0 · Accepted Answer

値が必要な理由と、合計でグループ化する必要がある$unwind理由がわかりません。巻き戻された値ごとに、時間バケット全体に対して計算された同じ値を出力するaため、これもバグがあるようです。at

私が理解している限り、クエリは次のようになります。

col.aggregate([
  // Pre-filter
  { $match: { /* ... */ } },

  // Pre-sort to aid in grouping
  { $sort: { c: 1, y: 1, m: 1, a: 1 },      

  // Group by month, customer and `a` to find unique `a` values and their totals
  { $group: { 
     _id: { c: '$c', y: '$y', m: '$m', a: '$a' },
     t: { $sum: 1 } 
    }
  },

  // Not sure if another sort is required at this point, I'd assume MongoDB
  // is smart enough to understand we're grouping by a subset of the original 
  // grouping key

  // Group by month and customer to count unique `a` values and grand total 
  { $group: {
    _id: { c: '$_id.c', y: '$_id.y', m: '$_id.m' },
    a: { sum: 1 }, // number of unique `a` values in group
    t: { sum: '$t' } // rolled-up total of all `a`-totals in group
  },

  // You can tack on further groupings by year and customer here,
  // although I believe these would be better done in the UI layer
]);

したがって、基本的には、巻き戻しと再グループ化を伴うパイプラインの開始と、中間の並べ替えによって、速度が低下する可能性があります。このバージョンのパフォーマンスが向上するかどうかを確認し、役立つ場合はグループ間の並べ替えを追加してみてください。

mongodb - MongoDB Aggregation Framework のパフォーマンスが数百万のドキュメントで遅くなる

バックグラウンド

ノート

要件

資料

索引

集計クエリ

集計結果

集計説明

質問

3 に答える 3

Related

Reference