database - 新しいセットへの 1 億行の集計

Question

アプリケーションが大きくなりすぎて、パフォーマンスが急速に低下し始めています。

1 億行のデータベーステーブルがあります。
2 つの日付の間でそのデータのセットを見つける必要があります。
そのセットの各行に何らかのアルゴリズムを適用します。
結果セット (約 1600 万行) を新しいテーブルに挿入します。

この問題を解決した場合は、その方法を説明していただけますか。

必要なテクノロジー、nosql または sql を使用できます。どの技術が優れているかを探しているわけではありません。これはさまざまな方法で実行できることを知っています。

私はただ探しています：

6 つの mongo シャードと map reduce を使用して、同様のデータセットでこれを解決しました。各マシンには 32 GB の RAM があります。または、SQL で分散パーティションを使用しました。128 GB の RAM と v high io を搭載した 1 台のマシンでできる限りの最適化を試みましたが、完了までに数時間かかっています。

score 0 · Accepted Answer

From your description it sounds like your data already fits onto a single machine, so sharding might not even be neccessary. You can create a clustered index on your date-time column. This operation in itself could take a large amount of time. Once you have that, selecting the 16 M rows you need to process should be fairly quick.

Does the processing of the data take a long time once you've found the 16M rows you need? You may want to insert the raw 16M rows (without processing) into a staging table, then create additional indexes which could aid the processing. If you can give more detail on this, I could give you some additional suggestions.

If the database is continuing to grow, a traditional time-based sharding may be effective too. You create a new database for every month of data, and in your application layer determine which database(s) you need to query and merge the result. This allows you to purge old data by simply dropping databases instead of selectively deleting massive amounts of data from existing tables. The latter can cause performance problems for other queries running at the same time on a live system.

database - 新しいセットへの 1 億行の集計

1 に答える 1

Related

Reference