python - h5py でデータ範囲にアクセスする

Question

62 の異なる属性を含む h5 ファイルがあります。それぞれのデータ範囲にアクセスしたいと思います。

ここで私がやっていることをもっと説明するために

import h5py 
the_file =  h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()

前のコードは、「U」、「T」、「H」などの属性のリストを提供します。

「U」の最小値と最大値を知りたいとしましょう。どうやってやるの？

これは、「h5dump -H」を実行した結果です。

HDF5 "myfile.h5" {
GROUP "/" {
   GROUP "data" {
      ATTRIBUTE "datafield_names" {
         DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_SPACEPAD;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
         DATASPACE  SIMPLE { ( 62 ) / ( 62 ) }
      }
      ATTRIBUTE "dimensions" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      }
      ATTRIBUTE "time_variables" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
      }
      DATASET "Temperature" {
         DATATYPE  H5T_IEEE_F64BE
         DATASPACE  SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
      }

score 10 · Accepted Answer

用語の違いかもしれませんが、hdf5 属性はattrsDataset オブジェクトの属性を介してアクセスされます。あなたが持っているものを変数またはデータセットと呼びます。ともかく...

あなたの説明から、属性は単なる配列であると推測しています。次のようにして各属性のデータを取得し、numpy 配列のように最小値と最大値を計算できるはずです。

attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()

したがって、各属性の最小値/最大値が必要な場合は、属性名に対して for ループを実行するか、使用できます

for attr_name,attr_value in data.items():
    min = attr_value[:].min()

最初のコメントに答えるために編集します。

h5py のオブジェクトは、 python 辞書のように使用できます。したがって、'keys()' を使用すると、実際にはデータを取得するのではなく、そのデータの名前(またはキー) を取得することになります。たとえば、実行the_file.keys()すると、その hdf5 ファイルのルートパスにあるすべての hdf5 データセットのリストが取得されます。パスに沿って進むと、実際のバイナリデータを保持するデータセットに行き着きます。したがって、たとえば、(最初はインタープリターで) から始めることができます。

the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]

print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]

# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()

編集 2 - なぜ人々は hdf ファイルをこのようにフォーマットするのですか? それは目的を破ります。

可能であれば、このファイルを作成した人に相談する必要があると思います。あなたがそれを作ったなら、あなたは私の質問に自分で答えることができるでしょう. まず、元の例でdata.keys()返されたのは確か"U","T",etc.ですか? h5py が魔法のようなことをしている場合や、h5dump のすべての出力を提供していない場合を除き、それは出力ではありませんでした。h5dump が私に何を伝えているかを説明しますが、端末にコピーアンドペーストするだけでなく、私が何をしているのかを理解するようにしてください。

# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()

# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()

h5dump からわかるように、62 datafield_names(文字列)、4 dimensions(32 ビット整数だと思います)、および 2 time_variables(64 ビット浮動小数点数) があります。Temperatureまた、256 x 512 x 1024 (64 ビット float) の 3 次元配列であることもわかります。私がこの情報をどこで入手しているか分かりますか? ここからが難しい部分です。配列とどのようにdatafield_names一致するかを判断する必要があります。Temperatureこれはファイルを作成した人によって行われたため、配列内の各行/列が何をTemperature意味するかを理解する必要があります。Temperature私の最初の推測は、配列内の各行がdatafield_names、多分毎回2つ以上？しかし、配列内の行が多すぎるため、これは機能しません。たぶん、寸法はそこにどのように収まりますか？最後に、これらの各情報を取得する方法を次に示します (前の続き)。

# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]

# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()

これ以上お役に立てなくて申し訳ありませんが、実際にファイルを持っていて、各フィールドが何を意味するのかを知らなければ、私にできることはこれだけです。h5py を使用して情報を読み取る方法を理解してください。ヘッダー情報 (h5dump 出力) を実際に h5py で使用できる情報にどのように変換したかを理解してください。データが配列内でどのように編成されているかを知っていれば、やりたいことができるはずです。頑張ってください。できることならもっとお手伝いします。

score 0 · Accepted Answer

h5py配列はnumpy配列と密接に関連しているため、numpy.min関数とnumpy.max関数を使用してこれを行うことができます。

maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'

'：'に注意してください。データをnumpy配列に変換する必要があります。

score 0 · Accepted Answer

minDataFrameでおよびmax（行方向）を呼び出すことができます。

In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))

In [2]: df
Out[2]: 
   U  T
0  1  6
1  5  2
2  4  3

In [3]: df.min(0)
Out[3]: 
U    1
T    2

In [4]: df.max(0)
Out[4]: 
U    5
T    6

score 0 · Accepted Answer

それ自体data.attrsではなくということですか？dataもしそうなら、

import h5py

with h5py.File("myfile.h5", "w") as the_file:
    dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
    dset.attrs['U'] = (0,1,2,3)
    dset.attrs['T'] = (2,3,4,5)    

with h5py.File("myfile.h5", "r") as the_file:
    data = the_file["MyDataset"]
    print({key:(min(value), max(value)) for key, value in data.attrs.items()})

収量

{u'U': (0, 3), u'T': (2, 5)}

python - h5py でデータ範囲にアクセスする

4 に答える 4

Related

Reference