CISL Annual Report banner  
   

Climate Data Compression Research

  Accuracy of new lossy compression
  This image compares the temperature field output from a high resolution POP ocean simulation (left) with a version of the data that has been lossily compressed by a factor of 20 to 1 (right). CISL's nascent Climate Data Compression Research project is investigating wavelet-based lossy data compression techniques applied to geosciences data. Lossy compression techniques may become essential to exploring, managing, and distributing ever-growing scientific data sets.

The Climate Data Compression Research project is a new effort, begun in FY 2006. Building on the success of CISL's VAPOR work, which employs wavelet-based progressive data access to permit the exploration of terascale data sets, CISL began investigating the application of wavelet-based lossy data compression techniques applied to climate model simulation outputs. The methods employed are similar to those now widely used in the compression of digital media. The goals of this nascent work are to:

  • Determine whether, and to what degree, scientific data sets can tolerate information loss
  • Investigate a variety of compression methods and determine which may be most appropriate for geosciences data
  • And if successful, develop user tools for data compression.

The exponential growth in transistor-count density predicted by Moore's law has led to ever-increasing computer processing power and has enabled computational scientists to numerically simulate physical phenomena at unprecedented scales, thereby generating extraordinary amounts of data. For example, the recent IPCC work yielded over 100 terabytes of climate model output. While microprocessor performance continues to double roughly every 18 months, other computing technologies are improving at much more modest rates. In particular, storage and networking bandwidths have lagged behind. As a result the challenge of storing, analyzing, managing, and sharing large simulation data sets is becoming increasingly problematic, hampering scientific productivity. Lossy signal compression techniques, such as those ubiquitously used for digital media and now being investigated by CISL, may provide relief for researchers drowning in a deluge of data.

While the extension of integer, digital media, lossy compression techniques to floating-point scientific data is relatively straightforward, the scale of the data, and the desire to preserve essential data properties, such as smooth derivatives, introduces many subtle challenges. In FY 2006, CISL has been working with domain scientists to identify promising wavelet decompositions for maintaining essential data properties while yielding high compression rates. Progress has also been made in developing computationally efficient algorithms for handling very large data. CISL is currently collaborating with three CGD groups, each with unique needs. CGD's Frank Bryan is providing POP ocean simulation data with very high spatial resolutions (e.g. 3600 x 2400). Grant Brantstator's atmospheric data, on the other hand, possesses hundreds of thousands of time steps, but low spatial resolution. Finally, CISL is working with Earth Systems Grid staff to deliver compressed CCSM data over the web via the ESG. All of these efforts are works in progress. Should these compression strategies prove viable, work in FY 2007 will focus on developing end-user tools. We will also broaden our base of scientific collaborators to further refine and validate this novel approach to tackling large scientific data sets.

This research supports NCAR's strategic priorities of "Developing and providing advanced services and tools" and "Conducting research in computer science, applied mathematics, statistics, and numerical methods." It is made possible by NSF Core funding.