Scientific data compression research
The Scientific Data Compression Research project began in FY2006. Building on the success of CISL's VAPOR work that employs wavelet-based progressive data access to permit the exploration of terascale data sets, CISL began investigating the application of wavelet-based lossy data compression techniques applied to a variety of model simulation outputs. The methods employed are similar to those now widely used in the compression of digital media. The goals of this work are to:
- Determine whether, and to what degree, scientific data sets can tolerate information loss
- Investigate a variety of compression methods and determine which may be most appropriate for geoscience data
- If successful, develop user tools for data compression
![]() |
|||
|
Using temperature data from the 1/10-degree POP ocean model, shown from left to right are the original data, a 64:1 compression, and a 512:1 compression. Image quality remains high even at high compression rates. This work holds potential to significantly improve our ability to store and visualize the data produced by high performance computers. |
|||
Exponential growth in transistor density in computers is producing ongoing increases in computer processing power. These increases enable computational scientists to create numerical simulations of physical phenomena at unprecedented scales, and this generates extraordinary amounts of data. For example, the recent IPCC work yielded over 100 terabytes of climate model data. While microprocessor performance continues to double roughly every 18 months, other computing technologies are improving at much more modest rates. In particular, storage and networking bandwidths have lagged behind. As a result, the challenge of storing, analyzing, managing, and sharing large simulation data sets is becoming increasingly problematic and hampers scientific productivity. Lossy signal compression techniques, such as those ubiquitously used for digital media and now being investigated by CISL, may provide relief for researchers drowning in data.
While extending integer, digital-media, lossy compression techniques to floating-point scientific data is relatively straightforward, the scale of the data and the desire to preserve essential data properties (such as smooth derivatives) introduces many subtle challenges. In FY2007, CISL continued working with a number of domain scientists to identify promising wavelet decompositions for maintaining essential data properties while applying high compression rates. Progress was also made in developing computationally efficient algorithms for handling very large data sets. CISL collaborated with three CGD groups at NCAR and one external group: the Southern California Earthquake Center (SCEC). Experiments with the CGD groups achieved varying degrees of success: the degree of compression that could be tolerated was found to be highly sensitive to the type of operation to be subsequently performed. In some cases, data could be aggressively compressed while in others only minimal compression was possible. More promising results were achieved with SCEC, where the set of data operators is small, known in advance, and preliminary work suggests that fairly aggressive compression can be tolerated.
All of these efforts are works in progress. In FY2008 we will continue experiments with various groups, in particular the SCEC seismic simulation data. We will also explore parallelization of our algorithms to prepare for petascale computing, and we will continue to investigate more efficient algorithms (both computationally and in storage requirements) for encoding compressed data.
This research supports NCAR's strategic priorities of "Developing and providing advanced services and tools" and "Conducting research in computer science, applied mathematics, statistics, and numerical methods." This work is made possible by NSF Core funding.
