java - What are my options to store and query huge amounts of data where a lot of it is repeating?

Question

I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.

The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.

The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.

Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.

I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.

So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.

I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.

Any suggestions or insight ?

score 4 · Accepted Answer

Hibernate (または任意の JPA ソリューション) は、このジョブには不適切なツールです。

JPA/Hibernate は軽量なソリューションではありません。大量のアプリケーションでは、オーバーヘッドは重要であるだけでなく、法外なものになります。グリッドとクラスターのソリューションを検討する必要があります。ここでは、さまざまなテクノロジの概要については繰り返しません。

私は金融市場情報システムで多くの経験を持っています。あなたが言ったことのいくつかは私に突き刺さりました：

生データがたくさんあります。
そのデータにさまざまな集計を適用したい (例: 始値/高値/安値/終値の毎日の概要)。
高可用性はおそらく問題です (この種のシステムでは常に問題になります)。と
低遅延はおそらく問題です (同上)。

ここで、グリッド/クラスタータイプのソリューションについて、大まかに 2 つのカテゴリに分類します。

Coherence や Terracotta などの地図ベースのソリューション。と
GigaSpaces のような Javaspaces ベースのソリューション。

私は Coherence をよく使用しており、Map ソリューションは便利ですが、問題もある可能性があります。Coherenceマップにはリスナーを含めることができ、この種のことを使用して次のようなことを行うことができます:

市場価格アラート (ユーザーは、価格が特定のレベルに達したときに通知を希望する場合があります);
デリバティブの価格設定 (たとえば、取引所で取引されるオプションの価格設定システムは、基礎となる証券の最終取引価格が変更されたときに価格を再設定する必要があります);
取引照合/予約システムは、調整目的で受信した取引通知を照合する必要がある場合があります。
等

これらはすべてリスナーで実行できますが、たとえばCoherenceではリスナーは安価である必要があります。これにより、別のマップに何かを書き込むよりもマップにリスナーがあり、これがしばらく連鎖する可能性があります。また、キャッシュエントリを変更すると問題が発生する可能性があります (その種の問題に対処するためのメカニズムもあります。これは、市場価格アラートをオフにして、2 回目のトリガーが発生しないようにするなどの状況について話しています)。

この種のアプリケーションには、GigaSpaces タイプのグリッドソリューションの方がはるかに魅力的であることがわかりました。読み取り (または破壊読み取り) 操作は非常に洗練されたスケーラブルなソリューションであり、サブミリ秒のパフォーマンスでトランザクショングリッドの更新を取得できます。

次の 2 つの古典的なキューイングアーキテクチャについて考えてみましょう。

リクエスト/レスポンス: 不良メッセージはキューをブロックする可能性があり、(スケーラビリティのために) 多数の送信者と受信者を処理できますが、パイプの数をスケールアップすることは必ずしも簡単ではありません。と
パブリッシュ/サブスクライブ: これは送信者と受信者を分離しますが、複数のサブスクライバーがいる場合、それぞれがメッセージを受信するという点でスケーラビリティに欠けます (予約システムとは必ずしも必要ではありません)。

GigaSpaces では、破壊的な読み取りはスケーラブルなパブリッシュ/サブスクライブシステムのようなものであり、読み取り操作は従来のパブリッシュ/サブスクライブモデルのようなものです。グリッドの上に構築された Map と JMS の実装があり、FIFO の順序付けを行うことができます。

さて、永続性についてはどうですか？永続性は、他のすべてのものを決定した結果です。この種のアプリケーションでは、私はPersistence as a Serviceモデルが好きです (皮肉なことに Hibernate について書かれていますが、それは何にでも当てはまります)。

基本的に、これは日付ストアのヒットが非同期であることを意味し、要約データを処理するのにうまく機能します。取引通知をリッスンするサービスを用意し、関心のあるものだけを永続化することができます (必要に応じてメモリに集約します)。この方法で始値/高値/安値/終値を実行できます。

大量のデータの場合、すべてをデータベースに書き込む必要はありません。とにかく同期的ではありません。永続的なストアとデータウェアハウスを組み合わせた方がおそらく望ましいルートですが、これも要件やボリュームなどによって異なります。

これは複雑なトピックであり、私は実際に触れただけです。お役に立てば幸いです。

score 2 · Accepted Answer

列指向のデータベースを検討します。このようなアプリケーションに最適です。

score 1 · Accepted Answer

多くの JDBC 対応データベース管理システム (Oracle など) は、物理ストレージエンジンで圧縮を提供します。たとえば、Oracle には、圧縮解除のオーバーヘッドのない「圧縮された」テーブルの概念があります。

http://www.ardentperf.com/wp-content/uploads/2007/07/advanced-compression-datasheet.pdf

score 0 · Accepted Answer

答えてくれてありがとう。

Cletus さん、概要には感謝していますが、利用可能なすべてのツールを使用できるようにするために、DB の柔軟性と JDBC/Hibernate との互換性を放棄することはできません。さらに、これについては明確に述べていませんが、[おそらく高価な] 商用ソリューションの採用をユーザーに強要したくありません。彼らがデータベースブランド X を持っている場合は、それを使用させてください。彼らが気にしないのであれば、オープンソースのデータベースブランド Y をお勧めします。レポートジェネレーターを書くビジネス。

まだ実際に負荷テストを行っていませんが、LucidDBには非常に感銘を受けました。これは列指向のデータベースであり、優れたクエリパフォーマンスと一見優れたデータ圧縮を提供します。私が知る限り、JDBC ドライバーはありますが、Hibernate 方言はまだ存在しません。また、ユーザー定義の変換もサポートしています。つまり、繰り返し値と連続する値を1つの「行」に圧縮するという私のアイデアをシームレスに実装できると思いますが、クエリ時にそれらを複数の「合成」行に吹き飛ばします。クエリの呼び出し元に。最後に、他の JDBC サポートデータベーステーブルを LucidDB で前面に配置できる外部テーブルのこの気の利いた機能をサポートしています。これは、他のデータベースにある程度のサポートを提供する上で非常に価値があると思います。

Javaman さん、ご指摘ありがとうございます。それは私をLucidDBに集中させました。

java - What are my options to store and query huge amounts of data where a lot of it is repeating?

5 に答える 5

Related

Reference