Earth System Grid (ESG) accomplishments

The Earth System Grid (ESG) is a DOE-funded project focused on building a Data Grid for climate research that facilitates management of and access to terascale climate model data across high-performance broadband networks. ESG current and planned functionality includes seamless data access from distributed online and deep storage via several protocols (GridFTP, HTTP, OPeNDAP), accurate metadata description of data holdings, replica management, definition of virtual datasets, server-side data subsetting and processing, and data analysis and visualization. To achieve its purpose, ESG integrates a wide range of IT software packages and components, including recent advances in Grid computing technologies developed by the Globus Alliance.

ESG is a collaboration of NCAR (SCD, CGD, and HAO), Argonne National Labs (ANL), Oak Ridge National Labs (ORNL), Lawrence Livermore National Labs (LLNL) Program for Climate Model Diagnosis and Interpretation (PCMDI), the University of Southern California Information Sciences Institute (USC/ISI), Lawrence Berkeley National Laboratory (LBNL), and Los Alamos National Laboratory (LANL).

Primary ESG servers

During the past year, the Earth System Grid has risen to be recognized as one of the leading IT infrastructures worldwide for accessing and distributing climate model data. The main ESG site at NCAR (http://www.earthsystemgrid.org) currently allows access to a large number of CCSM (Community Climate System Model) and PCM (Parallel Climate Model) datasets that are stored either on a large disk farm at NCAR, or at several deep storage facilities around the country (NERSC, ORNL HPSS, and NCAR MSS), totaling over 120 TB of data or over 800,000 data files.

The NCAR ESG portal has also become the official distribution site for NCL and PyNGL software, which can be used to great effect to analyze climate model data. Work is currently under way to set up LANL as a contributing ESG data node, serving ocean model data from different models. Since its release in the summer of 2004, as of September 2005 the NCAR ESG portal has received over 1,500 registrations, and allowed the download of ~9 TB of data corresponding to ~16,000 data files.

ESG website

Additionally, an earlier version of the portal software was used by the PCMDI group at LLNL to set up a portal (https://esg.llnl.gov:8443/) dedicated to the distribution of IPCC (Intergovernmental Panel on Climate Change) data worldwide. The IPCC portal indexes data from 23 different climate models, totaling 26.5 TB or ~60,000 data files. Over the course of the IPCC study, the portal allowed the download of 45 TB of data (~220,000 data files), the analysis of which resulted in ~200 scientific papers written on the topic of climate change.

On the technical side, the ESG infrastructure has been upgraded in several respects during the past year:

  • Access Control middleware was developed and deployed, which allows only users with the proper authorization to download data from restricted HTTP servers. In this model, authorization is always performed by the ESG portal, and data requests eventually redirected to the distributed data nodes with a short-lifetime authorization token.

  • The Access Control system was integrated with support for Data Mover Light, a Java client that, once installed on the user desktop, may be used to download a large number of files from the ESG system in one single request.

  • The ESG portal was augmented with a completely web-based application for data publishing, which has greatly simplified and expedited the process of making new (and old) datasets available to the community. This application will soon be complemented by the capability of performing complex browsing and querying of the model runs database of a specific contributing project, like CCSM or PCM, which should be of great use to the scientists and data managers running the climate simulations.

  • The OPeNDAP-G server was modified to support better processing of subsetting requests over virtual aggregated datasets. The new version avoids unnecessary serialization in OPeNDAP format if the input and output formats are both netCDF, which allowed an extremely significant improvement in performance. Several of the ESG datasets can now be accessed (and subsetted) as virtual aggregates on the ESG portal.

  • The general data access architecture of the ESG system was upgraded in response to the (indirect) security crisis that affected many national laboratories around the country, and eventually resulted in restored connectivity to ORNL HPSS and NERSC via single, easily monitored service accounts.

Data Mover Light
 

 

FY2005 Annual Report