Why use HDF5 as the primary backend for NWB?
HDF5 has three main features that make it ideal as our primary backend:
1. HDF5 has sophisticated structures for organizing complex hierarchical data and metadata, which is critical for handling the complexity and diversity of neurophysiology metadata.
HDF5 is one of the few standards that supports the four data primitives of the HDMF schema language: Group, Dataset, Attribute and Link. Each of these structures are essential in the full representation of critical metadata. Groups allow NWB to organize information hierarchically. Datasets allow NWB to store the large data blocks of measurement time series data. Attributes allow those datasets to be annotated with metadata necessary for reanalysis such as the units of measurement of that data and conversion factor. Links allow us to store data efficiently and avoid data duplication, and they allow us to create formal links between data elements.
2. HDF5 is a mature standard with support in a plethora of programming languages and multiple storage backends.
Another important feature of HDF5 is the ability to store it in different backends. A new driver, “ros3”, allows HDF5 files to be opened, read, and streamed directly from an S3 bucket, which is a common format for cloud storage.
3. HDF5 supports random access of chunked and compressed datasets, which is critical for handling the volume of data.
As recordings enter the TB scale, it is essential that we use a backend storage solution that supports both compression and random access. When large datasets are saved to disk, it is best to use lossless compression, which leverages patterns in the data to reduce the file size without changing the data values. HDF5 natively supports compressing datasets on write and decompressing datasets on read using GZIP (like “unarchiving” a file downloaded from the internet). Another important feature for large datasets is random access, which means that you can access any value within the datasets without reading all the values. If you were to apply GZIP to the entire dataset all at once, then it would require you to decompress the entire dataset and remove the capability for random access. HDF5 solves this problem by first splitting large datasets into “chunks” and compressing each of these chunks individually. This way when values of a particular region of the dataset are requested, only the chunks that contain requested data need to be decompressed. HDF5 has a sophisticated infrastructure for managing chunks of datasets and applying compression/decompression, removing these lower-level concerns from a data user who is reading the data.
These features have proven to be very important for archiving large datasets. For instance, in raw data from Neuropixel recordings, it has been found to reduce the file size by up to 60%. As datasets grow in volume and in number, it will become increasingly important to utilize good data engineering principles to manage them at scale.
Below, we briefly explain the pros and cons of alternative backends. Depending on the particular application and storage needs, different backends are often preferable. In particular as part of HDMF, teams are exploring the use of alternate storage solutions with NWB. For the broader NWB community, we have found that HDF5 provides a good standard solution for most common use cases.
Zarr supports compression and chunking like HDF5. Zarr is the standard we have found that comes closest to HDF5’s level of support for complex hierarchical data structures. However, Zarr does not support Links natively and support for links is not on the Zarr development roadmap. Links are an important feature for NWB to facilitate linking of data and metadata across complex collections of neurophysiology data products. Furthermore, Zarr only supports Python and the neuroscience community requires APIs in MATLAB and other languages. Also, HDF5 is a much more mature standard with a track record for long-term accessibility.
Binary files (.dat)
Binary files do not allow for complex hierarchical data including Groups, Attributes, and Links. They also do not allow for chunking and compression, which makes them poorly suited for efficient handling of large data files. Furthermore, there is metadata needed to interpret binary files that can be missing, including shape, data type, and endianness. Zarr is an approach that uses binary files and deals with these limitations, using folders and json files to create a hierarchical structure that can manage data chunks and specify the essential parameters of binary files. See our response to Zarr.
Relational database (e.g. SQL)
The HDMF specification language is inherently hierarchical, not tabular, and we need a storage layer that can express the hierarchical nature of the data as well. There are some approaches for mapping between relational tables and hierarchical structures such as object relational mappers, but this is not as good of a solution as using a storage layer that is hierarchical by nature.
While we think relational databases are not ideal as an NWB backend, we do recognize that they can be a powerful choice for storing scientific data because they enforce formal relationships between data and enable flexible, complex queries. If you are interested in using relational databases for neuroscience research, we would recommend exploring DataJoint, an open-source framework for programming scientific databases with computational workflows with APIs in MATLAB and Python. DataJoint Elements is a collection of curated modules for assembling workflows for the major modalities of neurophysiology experiments. The NWB team is collaborating with DataJoint to build import/export functionality between DataJoint Elements and NWB files. For labs interested in leveraging the benefits of relational databases and NWB, using DataJoint internally and using NWB to archive and share data could provide the best of both worlds.