Review-8
University Research Organization
Review of HDF5 operational readiness:
NASA's Earth Science Data Systems Standards Process Group (SPG) is considering the HDF5 for adoption as a community standard. This is the second review of HDF5, this one focusing on its readiness for operational use. The questions below are provided to guide feedback from data systems, application providers, instrument teams and others. You only need to answer questions applicable to you. Please send comments to spg-rfc-007@lists.nasa.gov.
- Describe in a sentence or two your overall experience related to HDF5 (e.g., science data provider, science data systems, software tools developer, and science data user, etc).
I have used HDF5 as both a science data provider as well as an end user.
- Do you currently use or plan to use HDF5 in a production setting? What types of applications do you use with HDF5? Is HDF5 applicable to your applications (e.g., Does it work well with the data types and data manipulations in your application?)
Yes, my team is using HDF5 in both Fortran and C++ production code. HDF5 is a great format for storing and distributing our science data. We have been able to easily store all of the data in the organizational structure which makes sense for the data. We have not needed to make any concessions on the ideal structure. We have also found HDF5 easy to use in utilities such as IDL and Matlab.
- Why do you choose to use HDF5 over other data formats for your applications?
- The ability to store data compactly, yet allow it to be read on any platform.
- The ability to add information about the data through the naming of fields and attributes
- The ability to read in only the data required and write out only the data which has changed.
- The ability to compress data fields and have it uncompressed automatically by the HDF5 library. End users do not need to do anything special to read compressed data.
- The ability to organize the storage of data in structures which are meaningful for the data.
- Have you or your users encountered any difficulty when using some of the data access or visualization tools (e.g., IDL, GrADS, ..) on HDF-5 data files? If you have, please provide a brief description of your experience.
Yes, we had a problem once using IDL to read HDF5 files. The problem occurred when HDF5 corrected a bug which changed the internal format of their files. Older versions of HDF5 were unable to read the newly created files. Because IDL included an older version of the HDF5 library internally within their application, files which were created with the more recent version of HDF5 were unable to be read. The only solution was to wait for IDL to issue a new release containing the then current library. Since HDF5 does not usually have a backwards compatibility problem, this delay of versions within IDL is usually not an issue.
- Does the performance of HDF5 you have experienced meet your requirements? (e.g., Can it handle the data types in your applications? Does it take a long time to read and write HDF5 files?)
Yes, HDF5 meets our requirements in both the ability to handle our data types as well as its performance. We can write our data files, using HDF5 internal compression and have it not make a significant impact on processing speed.
That said, it should be noted that it is quite possible to easily create a file where I/O performance is unacceptable. This can occur when data is written in little pieces and performance is degraded even more if compression is being used. First time users can fall into this trap fairly easily. One of my first files had this problem, and a quick consultation with the HDF Group via the help desk led to the discovery and solution of the problem.
- What operational challenges or limitations does HDF5 present? (e.g., Does it take a long time to learn how to use it? Does it require advanced processing power, large amounts of memory, complex configuration, etc)
HDF5 does take some time to learn and requires a few calls in order to write out even the simplest data. It is more complicated than saying "write" or "print" in Fortran/C programs. Data providers can aid the reading process by providing a sample code to read their data.
- What benefits does HDF5 present? Do the benefits of HDF5 outweigh the challenges? (e.g., Does it offer the flexibility you want to package the data types in your applications? Does it facilitate interdisciplinary studies?)
There are numerous benefits to using HDF5 - see the list in question #3. Yes, the benefits outweigh the challenges.
My group has been using various versions of HDF for over 12 years for the storage of our data products. The initial reason we used HDF was due to a NASA mandate to use it (we had been using netCDF prior to that). We have remained with HDF over the years for the reasons listed in question #3. I am a firm believer in that the benefits far outweigh the programming overhead and plan to continue to use HDF5 for storage of data on future projects.
- How much data do/will you provide or archive in HDF5? (number of distinct data products or data sets, total data volume, number of files.)
Our entire HIRDLS data product set from our NASA atmospheric satellite mission is being stored and distributed in HDF5. The current delivered/archived data product is currently broken into two data files. Each file can contain information for up to 12 different chemical species and also includes information useful in cloud and gravity wave studies. The size of the two files we are currently archiving can be up to 500 Mb combined per day. There will be one file of each type, each day, for the length of the mission. The HIRDLS mission extends from late January 2005 to date. We expect to reprocess the entire mission a number of times.
- How many users do you have or expect to have for data in HDF5, and what is your expected user community?
Researchers who are interested in atmospheric chemistry data and related topics will be the users of our data. If you need to know how many people may access our data products, I would suggest contacting the Goddard DISC (previously known as the DAAC). They are the main distributors of our data in the US. RAL in the UK will be distributing the data overseas. I would assume both organizations have projections on this.