python - Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

Question

I am working with some network simulator. After making some extensions to it, I need to make a lot of different simulations and tests. I need to record:

simulation scenario configurations
values of some parameters (e.g. buffer sizes, signal qualities, position) per devices per time unit t
final results computed from those recorded values

Second data is needed to perform some visualization after simulation was performed (simple animation, showing some statistics over time).

I am using Python with matplotlib etc. for post-processing the data and for writing a proper app (now considering pyQt or Django, but this is not the topic of the question). Now I am wondering what would be the best way to store this data?

My first guess was to use XML files, but it can be too much overhead from the XML syntax (I mean, files can grow up to very big sizes, especially for the second part of the data type). So I tried to design a database... But this also seems to me to be not the proper way... Maybe a mix of both?

I have tried to find some clues in Google, but found nothing special. Have you ever had a need for storing such data? How have you done that? Is there any "design pattern" for that?

score 5 · Accepted Answer

Separate concerns:

Apart from pondering on the technology to use for storing data (DBMS, CSV, or maybe one of the specific formats for scientific data), note that you have three very different kinds of data to manage:

Simulation scenario configurations: these are (typically) rather small, but they need to be simple to edit, simple to re-use, and should allow to reproduce a simulation run. Here, text or code files seem to be a good choice (these should also be version-controlled).
Raw simulation data: this is where you should be really careful if you are concerned with simulation performance, because writing 3 GB of data during a run can take a huge amount of time if implemented badly. One way to proceed would be to use existing file formats for this purpose (see below) and see if they work for you. If not, you can still use a DBMS. Also, it is usually a good idea to include a description of the scenario that generated the data (or at least a reference), as this helps you managing the results.
Data for post-processing: how to store this mostly depends on the post-processing tools. For example, if you already have a class structure for your visualization application, you could define a file format that makes it easy to read in the required data.

Look for existing solutions:

The problem you face (How to manage simulation data?) is fundamental and there are many potential solutions, each coming with certain trade-offs. As you are working in network simulation, check out what capabilities other tools used in your community provide. It could be that their developers ran into problems you are not even anticipating yet (regarding reproducibility etc.), and already found a good solution. For example, you could check out how OMNeT++ is handling simulation output: the simulation configurations are defined in a separate file, results are written to vec and sca files (depending on their nature). As far as I understood your problems with hierarchical data, this is supported as well (vectors get unique IDs and are associated with an attribute of some model entity). Additional tools already work with these file formats, e.g. to convert them to other formats like CSV/MATLAB files, so you could even think of creating files in the same format (documented here) and to use existing tools/converters for post-processing.

Many other simulation tools will have similar features, so take a look at what would work best for you.

score 1 · Accepted Answer

データ構造がよく知られていて安定していて、SQLクエリ/計算機能のいくつかが必要な場合は、SQLiteのような軽量のリレーショナルDBが最適です（最終的な3 GB以上のデータを処理できることを確認してください）。

それ以外の場合（つまり、各シミュレーションシナリオには、結果を格納するための専用のデータ構造が必要な場合があります）、SQL機能は必要ありません。その場合は、より自由形式のソリューション（ドキュメント指向データベース、OOデータベース、ファイルシステム+csv、何でも）。

2番目のケースでもSQLデータベースを使用できますが、結果セットごとにテーブルを動的に作成する必要があります。もちろん、関連するSQLクエリも動的に作成する必要があります。

score 1 · Accepted Answer

It sounds like you need to record more or less the same kinds of information for each case, so a relational database sounds like a good fit-- why do you think it's "not the proper way"?

If your data fits in a collection of CSV files, you're most of the way to a relational database already! Just store in database tables instead, and you have support for foreign keys and queries. If you go on to implement an object-oriented solution, you can initialize your objects from the database.

python - Preferred (or recommended) way to store large amounts of simulation configurations, runs values and final results

3 に答える 3

Related

Reference