Review-2
Earth Observing Mission
Review of HDF5 operational readiness:
NASA's Earth Science Data Systems Standards Process Group (SPG) is considering the HDF5 for adoption as a community standard. This is the second review of HDF5, this one focusing on its readiness for operational use. The questions below are provided to guide feedback from data systems, application providers, instrument teams and others. You only need to answer questions applicable to you. Please send comments to spg-rfc-007@lists.nasa.gov.
- Describe in a sentence or two your overall experience related to HDF5 (e.g., science data provider, science data systems, software tools developer, and science data user, etc).
As systems engineer for data architecture within the NPOESS integrated program office, my role is to assure that the design of the NPP and NPOESS products is adequate to serve operational and science needs for these data.
- Do you currently use or plan to use HDF5 in a production setting? What types of applications do you use with HDF5? Is HDF5 applicable to your applications (e.g., Does it work well with the data types and data manipulations in your application?)
HDF5 is the data distribution standard for the NPOESS program. All data products produced by the NPOESS program will be transmitted to operational users, and through the archive to science users will use HDF5. NPOESS data will be used in weather observation, modeling and prediction as well as climate science analysis.
- Why do you choose to use HDF5 over other data formats for your applications?
The purpose of using HDF5 is to assure platform independence and efficient access. The HDF5 data model is well suited to the information model for NPOESS products. Particular features that are important are the grouping of related data into HDF groups and the association of attributes to both groups and datasets. We also use the HDF5 region and object references to delineate regions within a product that are associated with a particular processing instance. HDF5 is a Defense Information Systems Agency (DISA) recognized "standard".
- Have you or your users encountered any difficulty when using some of the data access or visualization tools (e.g., IDL, GrADS, ..) on HDF-5 data files? If you have, please provide a brief description of your experience.
We have not tried to access our products using data visualization tools other than HDFView.
Some practices in our products are not directly supported by HDF5 and we anticipate that there may be some difficulty in using common visualization tools to access these data. In particular, it is common in so-called "binary" formats to pack small data items (smaller than a byte) into "bitfields". Such packed data is then stored using a compiler native type, such as "unsigned character". Such data structures are common in NPOESS products and are not directly supported by HDF5. We understand that there are alternate methods for efficiently storing such data in HDF5 and that our particular implementation choice obscures the data, thereby making it difficult for tools to access. The affected information is mostly "element quality flag" kinds of values. Another common practice is use of linear scaling as a data compression technique. As, there was no standard implementation of predefined linear scaling for HDF5 datasets at the time we designed them, our particular implementation is not likely to be understood by common visualization tools.
Also, we understand that the HDF5 region reference and object reference APIs are not fully supported by most visualization tools. In particular, our chosen visualization tool, HDFView, does not support it. Our particular use of reference region is to reference single hyperslabs within a larger dataset, but the reference structures in HDF5 can be considerably more complex than that making general-purpose implementation of the API difficult for visualization software.
- Does the performance of HDF5 you have experienced meet your requirements? (e.g., Can it handle the data types in your applications? Does it take a long time to read and write HDF5 files?)
Performance meets our requirements. While HDF overhead for packaging our data in HDF5 is measurable, it is negligible compared to the cost of computing the data and even compared to simple data transfer. The computational and storage overhead of HDF5 is quite small in our experience.
- What operational challenges or limitations does HDF5 present? (e.g., Does it take a long time to learn how to use it? Does it require advanced processing power, large amounts of memory, complex configuration, etc)
HDF5 does require learning a new API for access to data products. The API is somewhat complex, but not more so than other object based data APIs. For example, it is similar in difficulty to such APIs as libxml for access to data stored in XML. Like other data manipulation APIs is relatively easy to learn basic access techniques but somewhat more difficult to learn to make efficient use of the API.
Our products are still in the design stage, so the use of HDF5 has been limited to computer science professionals and our customer community has not yet felt the impact of our design decisions. We expect to learn more about the challenges that HDF5 presents to our customers in the next few years as our program transitions from design to operations.
- What benefits does HDF5 present? Do the benefits of HDF5 outweigh the challenges? (e.g., Does it offer the flexibility you want to package the data types in your applications? Does it facilitate interdisciplinary studies?)
The benefit that we rely on is cross-platform portability of data. This benefit outweighs the challenge.
- How much data do/will you provide or archive in HDF5? (number of distinct data products or data sets, total data volume, number of files.)
On the order of hundreds of collections of data products, totaling several terabytes per day over may years. Total data volume is expected to be on the order of 3.8 terabytes per day for NPP and 7.8 terabytes per day for NPOESS.
- How many users do you have or expect to have for data in HDF5, and what is your expected user community?
Thousands of downstream users, but the principle users will be several large data centers.