How to use HDF5 files in Python

Hello!

The launch of the course “Web-developer in Python” is approaching, respectively, we still share interesting articles and invite you to our open lessons where you can watch interesting material, meet teachers and ask questions.

Go!

HDF5 allows you to efficiently store large amounts of data

When working with large amounts of data, whether experimental or simulated, storing them in several text files is not very efficient. Sometimes you need to access a particular subset of data, and you want to do it quickly. In these situations, HDF5 solves both problems thanks to a very optimized built-in library. HDF5 is widely used in scientific environments and has an excellent implementation in Python, designed to work with NumPy right out of the box.

The HDF5 format supports files of any size, and each file has an internal structure that allows you to search for a specific data set. This can be represented as a separate file with its own hierarchical structure, as well as a set of folders and subfolders. By default, data is stored in binary format, and the library is compatible with different data types. One of the most important choices of the HDF5 format is that it allows you to attach metadata to each structure element, which makes it ideal for creating stand-alone files.


In Python, an interface with the HDF5 format can be built using the h5py package. One of the most interesting features of this package is that data is read from a file only when necessary. Imagine that you have a very large array that does not fit into your available RAM. For example, you could generate an array on a computer with different specifications, as opposed to the one you use for data analysis. The HDF5 format allows you to choose which elements of the array should be read with a syntax equivalent to NumPy. Then you can work with the data stored on the hard disk, not in RAM, without significant changes to the existing code.

In this article we will look at how you can use h5py to store and retrieve data from a hard disk. We will discuss various ways to store data and how to optimize the reading process. All examples that appear in this article are also available in our Github repository .

Installation

The HDF5 format is supported by the HDF Group , and it is based on open source standards, which means that your data will always be available even if the group disappears. Python support is provided through the h5py package, which can be installed via pip. Remember that you must use a virtual environment to conduct tests:

pip install h5py 

this command will also install NumPy if it is not in your environment.

If you are looking for a graphical tool to examine the contents of your HDF5 files, you can install HDF5 Viewer . It is written in Java, so it should work on almost any computer.

Basic data storage and reading

Let's move on to using the HDF5 library. We will create a new file and save a random NumPy array into it.

 import h5py import numpy as np arr = np.random.randn(1000) with h5py.File('random.hdf5', 'w') as f: dset = f.create_dataset("default", data=arr) 

The first few lines are pretty simple: we import the h5py and NumPy packages and create an array with random values. We open the file random.hdf5 with permission to write w, this means that if a file with that name already exists, it will be overwritten. If you want to save the file and still be able to write to it, you can open it with the a attribute instead of w. We create a dataset named default and set the data as a random array created earlier. Data sets (dataset) are the custodians of our data, mainly building blocks of the HDF5 format.

The note

If you are not familiar with the with statement, I must point out that this is a convenient way to open and close files. Even if an error occurs inside with , the file will be closed. If for some reason you are not using with , never forget to add the f.close() command to the end. The with statement works with any files, not just HDF files.

We can read the data in almost the same way as we read the NumPy file:

 with h5py.File('random.hdf5', 'r') as f: data = f['default'] print(min(data)) print(max(data)) print(data[:15]) 

We open the file with the read attribute r and restore the data, directly accessing the data set with the name default. If you open a file and do not know which datasets are available, you can get them:

 for key in f.keys(): print(key) 

After you have read the dataset you wanted, you can use it as if you were using any NumPy array. For example, you can find out the maximum and minimum values ​​or select the first 15 values ​​of the array. These simple examples, however, hide many things that occur under the hood, and they need to be discussed in order to understand the full potential of HDF5.

In the example above, you can use the data as an array. For example, you can refer to the third element by entering data [2], or you can get the range of data [1: 3]. Note: the data is not an array, it is a data set. You can see it by typing print(type(data)) . Data sets work very differently from arrays, because their information is stored on the hard disk, and they do not load it into RAM if we do not use them. The following code, for example, will not work:

 f = h5py.File('random.hdf5', 'r') data = f['default'] f.close() print(data[1]) 

The error that appears is a bit cumbersome, but the last line is very useful:

 ValueError: Not a dataset (not a dataset) 

The error means that we are trying to access a data set to which we no longer have access. This is a bit confusing, but this is because we closed the file and therefore we are no longer allowed to access the second value in the data. When we assigned f ['default'] to variable data, we actually do not read the data from the file, instead we generate a pointer to where the data is on the hard disk. On the other hand, this code will work:

 f = h5py.File('random.hdf5', 'r') data = f['default'][:] f.close() print(data[10]) 

Please note that the only difference is that we added [:] after reading the data set. Many other tutorials stop at such examples, without even demonstrating the full potential of the HDF5 format with the h5py package. Because of the examples we have reviewed so far, you might be wondering: why use HDF5 if saving NumPy files gives you the same functionality? Let's dive into the features of the HDF5 format.

Selective reading from HDF5 files

So far, we have seen that when we read a dataset, we still do not read the data from the disk, instead we create a link to a specific place on the hard disk. We can see what happens if, for example, we explicitly read the first 10 elements of a data set:

 with h5py.File('random.hdf5', 'r') as f: data_set = f['default'] data = data_set[:10] print(data[1]) print(data_set[1]) 

We divide the code into different lines to make it more explicit, but you can be more synthetic in your projects. In the lines above, we first read the file, and then we read the default data set. We assign the first 10 items of the data variable to the data variable. After closing the file (when it ends), we can access the values ​​stored in data, but the data_set will generate an error. Note that we only read from disk when we explicitly refer to the first 10 items of the data set. If you look at the data and data_set types, you will see that they are really different. The first is the NumPy array, and the second is the h5py DataSet.

The same behavior is relevant in more complex scenarios. Let's create a new file, this time with two sets of data, and let's select the elements of one of them based on the elements of the other. Start by creating a new file and storing data; This part is the simplest:

 import h5py import numpy as np arr1 = np.random.randn(10000) arr2 = np.random.randn(10000) with h5py.File('complex_read.hdf5', 'w') as f: f.create_dataset('array_1', data=arr1) f.create_dataset('array_2', data=arr2) 

We have two data sets called array_1 and array_2, each of which contains a random NumPy array. We want to read the values ​​of array_2, which correspond to elements where the values ​​of array_1 are positive. We can try to do something like this:

 with h5py.File('complex_read.hdf5', 'r') as f: d1 = f['array_1'] d2 = f['array_2'] data = d2[d1>0] 

but it won't work. d1 is a data set and cannot compare with an integer. The only way is to actually read the data from the disk and then compare it. So we get something like this:

 with h5py.File('complex_read.hdf5', 'r') as f: d1 = f['array_1'] d2 = f['array_2'] data = d2[d1[:]>0] 

The first d1 dataset is fully loaded into memory when we do d1 [:], but from the second d2 dataset we take only some elements. If the d1 data set would be too large to load into memory as a whole, we could work inside the loop.

 with h5py.File('complex_read.hdf5', 'r') as f: d1 = f['array_1'] d2 = f['array_2'] data = [] for i in range(len(d1)): if d1[i] > 0: data.append(d2[i]) print('The length of data with a for loop: {}'.format(len(data))) 

Of course, there are problems with the efficiency of element-by-element reading and adding elements to the list, but this is a very good example of one of the biggest advantages of using HDF5 over text or NumPy files. Inside the loop, we load only one item into memory. In our example, each element is just a number, but it could be anything from text to image or video.

As always, depending on your application, you will have to decide whether you want to read the entire array into memory or not. Sometimes you run simulations on a specific computer with a large amount of memory, but you do not have the same characteristics on your laptop, and you are forced to read pieces of your data. Remember that reading from a hard disk is relatively slow, especially if you use HDD instead of SDD-disks or even longer if you read from a network drive.

Selective recording to HDF5 files

In the examples above, we added data to the dataset as soon as it was created. However, for many applications you need to save data during generation. HDF5 allows you to save data in almost the same way as you read it. Let's take a look at how to create an empty dataset and add some data to it.

 arr = np.random.randn(100) with h5py.File('random.hdf5', 'w') as f: dset = f.create_dataset("default", (1000,)) dset[10:20] = arr[50:60] 

The first two lines are the same as before, except for create_dataset . We do not add data when it is created, we simply create an empty data set that can hold up to 1000 elements. With the same logic as before, when we read certain elements from a dataset, we actually write to disk only when we assign values ​​to certain elements of the dset variable. In the example above, we assign values ​​only to a subset of the array, with indices from 10 to 19.

A warning

It’s not quite true what you write to disk when you assign values ​​to a data set. The exact time depends on several factors, including the state of the operating system. If the program closes too early, it may happen that not everything will be recorded. It is very important to always use the close() method, and in case you write in stages, you can also use flush() to force the record. Using with prevents many recording problems.

If you read the file and print out the first 20 values ​​of the dataset, you will see that they are all zeros, with the exception of indexes 10 to 19. There is a common mistake that can lead to a tangible headache. The following code will not save anything on disk:

 arr = np.random.randn(1000) with h5py.File('random.hdf5', 'w') as f: dset = f.create_dataset("default", (1000,)) dset = arr 

This error always delivers a lot of problems, because you will not understand that you have not recorded anything until you try to read the result. The problem here is that you do not specify where you want to store the data, you simply overwrite the dset variable with a NumPy array. Since the dataset and array are the same length, you must use dset [:] = arr. This error happens more often than you think, and since it is not technically incorrect, you will not see any errors displayed on the terminal, and your data will be zeros.

Until now, we have always worked with one-dimensional arrays, but we are not limited to them. For example, suppose we want to use a 2D array, we can simply do:

 dset = f.create_dataset('default', (500, 1024)) 

which allows us to store data in an array of 500x1024. To use a dataset, we can use the same syntax as before, but taking into account the second dimension:

 dset[1,2] = 1 dset[200:500, 500:1024] = 123 

Specify data types for space optimization

So far, we have only considered the tip of the iceberg of what HDF5 has to offer. In addition to the length of the data you want to save, you can specify the type of data for space optimization. The h5py documentation contains a list of all supported types, here we show only a couple of them. At the same time, we will work with several data sets in one file.

 with h5py.File('several_datasets.hdf5', 'w') as f: dset_int_1 = f.create_dataset('integers', (10, ), dtype='i1') dset_int_8 = f.create_dataset('integers8', (10, ), dtype='i8') dset_complex = f.create_dataset('complex', (10, ), dtype='c16') dset_int_1[0] = 1200 dset_int_8[0] = 1200.1 dset_complex[0] = 3 + 4j 

In the example above, we created three different data sets, each of which has a different type. Integers from 1 byte, integers 8 bytes and complex numbers from 16 bytes. We store only one number, even if our data sets can contain up to 10 elements. You can read the values ​​and see what was actually saved. It should be noted here that an integer of 1 byte should be rounded to 127 (instead of 1200), and an integer of 8 bytes should be rounded to 1200 (instead of 1200.1).

If you have ever programmed in languages ​​like C or Fortran, you probably know what the different data types mean. However, if you have always worked with Python, you may not have encountered any problems without explicitly declaring the type of data you are working with. It is important to remember that the number of bytes tells you how many different numbers you can save. If you use 1 byte, you have 8 bits, and therefore you can store 2 ^ 8 different numbers. In the example above, integers are both positive, negative, and 0. When you use 1 byte integers, you can store values ​​from -128 to 127, for a total of 2 ^ 8 possible numbers. This is equivalent to using 8 bytes, but with a large range of numbers.

The type of data selected will affect its size. First, let's see how this works with a simple example. Create three files, each with one data set for 100,000 items, but with different data types. We will store the same data in them, and then compare their sizes. We create a random array to assign each data set to fill the memory. Remember that the data will be converted to the format specified in the data set.

 arr = np.random.randn(100000) f = h5py.File('integer_1.hdf5', 'w') d = f.create_dataset('dataset', (100000,), dtype='i1') d[:] = arr f.close() f = h5py.File('integer_8.hdf5', 'w') d = f.create_dataset('dataset', (100000,), dtype='i8') d[:] = arr f.close() f = h5py.File('float.hdf5', 'w') d = f.create_dataset('dataset', (100000,), dtype='f16') d[:] = arr f.close() 

When you check the size of each file, you will get something like:

FileSize (b)
integer_1102144
integer_9802144
float1602144

The relationship between size and data type is quite obvious. When you go from integers from 1 byte to 8 bytes, the file size increases 8 times, similarly, when you go to 16 bytes, it takes about 16 times more space. But space is not the only important factor to consider, you must also consider the time required to write data to a disk. The more you have to write, the longer it will take. Depending on your application, it may be extremely important to optimize reading and writing data.

Please note: if you use the wrong data type, you may also lose information. For example, if you have an integer of 8 bytes, and you store them as integers of 1 byte, their values ​​will be truncated. When working in the laboratory, there are very often devices that create different types of data. Some DAQ cards have 16 bits, some cameras work with 8 bits, but some of them may work with 24. It is important to pay attention to data types, but this is also something that Python developers may not take into account, because you don’t need to explicitly declare type

It is also interesting to remember that the default NumPy array will be initialized to float 8 bytes (64 bits) per element. This can be a problem if, for example, you initialize an array with zeros for storing data, which should be only 2 bytes. The type of the array itself will not change, and if you save the data when creating a dataset (by adding data = my_array), the default format will be “f8”, which is an array, but not real data,

Thinking about data types is not something that happens regularly if you work with Python in simple applications. However, you should know that there are data types and how they can affect your results. You may have large hard drives, and you do not particularly care about storing files, but when you care about the speed with which you save, there is no other way than to optimize every aspect of your code, including data types.

Data compression

When saving data, you can choose compression using different algorithms. The h5py package supports several compression filters, such as GZIP, LZF and SZIP. When using one of the compression filters, the data will be processed on its way to the disk, and when read, they will be unpacked. Therefore, there are no special changes in the code. We can repeat the same experiment, saving different data types, but using a compression filter. Our code looks like this:

 import h5py import numpy as np arr = np.random.randn(100000) with h5py.File('integer_1_compr.hdf5', 'w') as f: d = f.create_dataset('dataset', (100000,), dtype='i1', compression="gzip", compression_opts=9) d[:] = arr with h5py.File('integer_8_compr.hdf5', 'w') as f: d = f.create_dataset('dataset', (100000,), dtype='i8', compression="gzip", compression_opts=9) d[:] = arr with h5py.File('float_compr.hdf5', 'w') as f: d = f.create_dataset('dataset', (100000,), dtype='f16', compression="gzip", compression_opts=9) d[:] = arr 

We chose gzip because it is supported on all platforms. Parameters compression_opts set the level of compression. The higher the level, the less space is occupied by the data, but the longer the processor should work. The default compression level is 4. We can see the differences in our files based on the compression level:

Type ofWithout compressionCompression 9Compression 4
integer_11021442801630463
integer_88021444332957971
float160214414695801469868

The impact of compression on whole data arrays is much more noticeable than on floating point data sets. I leave it to you to figure out why compression worked so well in the first two cases, but not in the last. As a hint: you should check what data you actually save.

Reading compressed data does not change any of the code described above. The main library of HDF5 will take care of extracting data from compressed data sets using an appropriate algorithm. Therefore, if you implement compression to save, you do not need to change the code you use to read.

Data compression is an additional tool that you must consider along with all other aspects of data processing. You need to consider additional processor time and effective compression to evaluate the benefits of data compression within your own application. The fact that it is transparent to the downstream code makes it incredibly easy to test and find the optimal solution.

Resizing datasets

When you are working on an experiment, it is sometimes impossible to know how big your data will be. Imagine that you are recording a movie, perhaps you will stop it in one second, perhaps in an hour. Fortunately, HDF5 allows you to resize data sets on the fly with little computational overhead. The length of the data set can be exceeded up to the maximum size. This maximum size is specified when creating a dataset using the maxshape keyword:

 import h5py import numpy as np with h5py.File('resize_dataset.hdf5', 'w') as f: d = f.create_dataset('dataset', (100, ), maxshape=(500, )) d[:100] = np.random.randn(100) d.resize((200,)) d[100:200] = np.random.randn(100) with h5py.File('resize_dataset.hdf5', 'r') as f: dset = f['dataset'] print(dset[99]) print(dset[199]) 

First you create a data set to store 100 values ​​and set the maximum size to 500 values. After you have saved the first batch of values, you can expand the data set to save the next 100. You can repeat the procedure until you get a data set with 500 values. , N- . , , .

, , . , - ( , , ):

 with h5py.File('resize_dataset.hdf5', 'a') as f: dset = f['dataset'] dset.resize((300,)) dset[:200] = 0 dset[200:300] = np.random.randn(100) with h5py.File('resize_dataset.hdf5', 'r') as f: dset = f['dataset'] print(dset[99]) print(dset[199]) print(dset[299]) 

, , 200 200 299. , , .

, , , . 2D-, , — , 2D-. 3- HDF-, . , :

 with h5py.File('movie_dataset.hdf5', 'w') as f: d = f.create_dataset('dataset', (1024, 1024, 1), maxshape=(1024, 1024, None )) d[:,:,0] = first_frame d.resize((1024,1024,2)) d[:,:,1] = second_frame 

1024x1024 , . , , . maxshape None.

(Chunks)

, . (chunk) , .. . , , . , :

 dset = f.create_dataset("chunked", (1000, 1000), chunks=(100, 100)) 

, dset [0: 100,0: 100] . dset [200: 300, 200: 300], dset [100: 200, 400: 500] . . h5py, :

(Chunking) . 10 KiB 1 MiB, . , , .

(auto-chunking), . , maxshape. :

 dset = f.create_dataset("autochunk", (1000, 1000), chunks=True) 

(Groups)

. HDF5, , . (groups), , . , :

 import numpy as np import h5py arr = np.random.randn(1000) with h5py.File('groups.hdf5', 'w') as f: g = f.create_group('Base_Group') gg = g.create_group('Sub_Group') d = g.create_dataset('default', data=arr) dd = gg.create_dataset('default', data=arr) 

Base_Group , Sub_Group. default . , , :

 with h5py.File('groups.hdf5', 'r') as f: d = f['Base_Group/default'] dd = f['Base_Group/Sub_Group/default'] print(d[1]) print(dd[1]) 

, : Base_Group/default Base_Group/Sub_Group/default. , , , , . — keys():

 with h5py.File('groups.hdf5', 'r') as f: for k in f.keys(): print(k) 

, , for-. , . visit(), :

 def get_all(name): print(name) with h5py.File('groups.hdf5', 'r') as f: f.visit(get_all) 

, get_all , , name. visit, get_all. visit , , None, . , , Sub_Group, get_all :

 def get_all(name): if 'Sub_Group' in name: return name with h5py.File('groups.hdf5', 'r') as f: g = f.visit(get_all) print(g) 

visit , , None, , get_all. Sub_Group, get_all , Sub_Group . , g , , :

 with h5py.File('groups.hdf5', 'r') as f: g_name = f.visit(get_all) group = f[g_name] 

. — , visititems, : name object. :

 def get_objects(name, obj): if 'Sub_Group' in name: return obj with h5py.File('groups.hdf5', 'r') as f: group = f.visititems(get_objects) data = group['default'] print('First data element: {}'.format(data[0])) 

visititems , , , . , , . . , , .

HDF5

, HDF5, , . , , , , , , .. . , , 200x300x250. , , , , — , .

HDF5 -. .

 import time import numpy as np import h5py import os arr = np.random.randn(1000) with h5py.File('groups.hdf5', 'w') as f: g = f.create_group('Base_Group') d = g.create_dataset('default', data=arr) g.attrs['Date'] = time.time() g.attrs['User'] = 'Me' d.attrs['OS'] = os.name for k in g.attrs.keys(): print('{} => {}'.format(k, g.attrs[k])) for j in d.attrs.keys(): print('{} => {}'.format(j, d.attrs[j])) 

, attrs . , , . , . , , , update:

 with h5py.File('groups.hdf5', 'w') as f: g = f.create_group('Base_Group') d = g.create_dataset('default', data=arr) metadata = {'Date': time.time(), 'User': 'Me', 'OS': os.name,} f.attrs.update(metadata) for m in f.attrs.keys(): print('{} => {}'.format(m, f.attrs[m])) 

, , hdf5, . , . hdf5, . Python -. JSON, , , , pickle.

 import json with h5py.File('groups_dict.hdf5', 'w') as f: g = f.create_group('Base_Group') d = g.create_dataset('default', data=arr) metadata = {'Date': time.time(), 'User': 'Me', 'OS': os.name,} m = g.create_dataset('metadata', data=json.dumps(metadata)) 

, . , . , json.dumps, . , HDF5. , json.loads:

Python
 with h5py.File('groups_dict.hdf5', 'r') as f: metadata = json.loads(f['Base_Group/metadata'][()]) for k in metadata: print('{} => {}'.format(k, metadata[k])) 

json , . YAML, XML .. , , , attr , , .

HDF5

, . , , , . HDF , , , , , . , HDF .

HDF5 . , , . , . . SQL, HDFql , SQL HDF5.

. , , - , , . , . , , .

HDF5 — , . , , , , . HDF5 — , , .

THE END

, .

Source: https://habr.com/ru/post/416309/


All Articles