|
|
||
| August 24, 2006 Earth System Grid: Easy Web access to terascale climate dataESG provides the geosciences community with model output, source code, initialization datasets, and tools for data publishing, analysis, and visualization
Just a few years ago, accessing the climate data in deep storage at various U.S. research centers was a daunting task for the global geosciences community. Different institutions formatted, organized, and served data in different ways. Authentication procedures and transfer protocols varied from site to site. Output from a single climate simulation was often archived in thousands of files. Data retrieval was complex, inefficient, and tedious. “Climate models were running on supercomputers at many sites, putting out enormous quantities of data,” says Don Middleton of the Scientific Computing Division at the National Center for Atmospheric Research (NCAR). “One or two specialists at each site would know where the data were―and even they weren’t always sure. Researchers would contact them and they’d go off and find the data. That worked, but it didn’t scale very far at all. At the same time, there was a general sense that these data were of value to people all over the world interested in climate change research and environmental-impacts assessments.” Since 2001, a partnership sponsored by the Department of Energy’s Office of Science under the auspices of the Scientific Discovery through Advanced Computing Program (SciDAC) has been working to make these data more generally available. Collaborators in the partnership, which spans DOE, the National Science Foundation, and the university community, include:
Principal investigators Ian Foster of ANL, Don Middleton of NCAR, and Dean Williams of LLNL lead a team of nearly two dozen computer scientists, application developers, modelers, and Grid computing experts who are tackling the problem of distributed terascale data. List of team members ESG: A large DataGridThe result of their efforts is the innovative Earth System Grid (ESG), a virtual collaborative environment that links distributed centers, users, models, and data. “ESG is a large DataGrid that provides data from computational and storage resources that are geographically distributed and not under centralized control,” says Middleton. “That’s the core of what Grid computing is all about: harnessing a collection of heterogeneous resources across different system administration groups and even across agencies, across various security boundaries and institutional policies.” ESG makes terascale climate data as easy to access as Web pages. Its main entry points are two Web portals: one for general climate research data (https://www.earthsystemgrid.org) and another dedicated to the activities of the U.S. Intergovernmental Panel on Climate Change (IPCC) (https://esg.llnl.gov:8443). Through these portals, modelers and data managers can publish their datasets. Users can register, search, browse, and acquire the data they need. The measure of success“Early on, we didn’t know how many people would be interested in these data—we were hoping maybe a few hundred,” notes Middleton. “But we found out that there’s a big audience out there, quite a large group who are interested, and for many different reasons.” Overall, ESG now has more than 3,200 registered users worldwide, ranging from climate scientists and university researchers to private companies and K-12 educators. Since it went live in 2004, ESG has become recognized as a leading infrastructure for accessing and distributing climate model data:
Integrating many technologiesAccessing climate data via a Web page may look easy—but making it possible was far from easy. Developing and deploying the system required serious problem solving, says Luca Cinquini, a software engineer in NCAR's Scientific Computing Division who led ESG portal development. “We had to integrate many pieces of Grid technology, such as the Globus toolkit, with technologies that are common in the business world, like Java, Tomcat, and Web portals,” says Cinquini. “We learned that sometimes it wasn’t as easy as might have been expected to apply new technologies to real-world applications. In some cases, we found scalability problems—things would be fine when we were serving small amounts of data, but when we started publishing more data and the size of our databases increased, we'd run into performance issues.” Today, having overcome numerous technical challenges, ESG is at the forefront of Grid technology:
Effectively delivering data to the community“My view of ESG is that it’s been an immensely successful project for a number of reasons,” says Peter Fox, chief computational scientist of NCAR’s High Altitude Observatory and a member of the ESG development team. “ESG is a significant implementation of a highly distributed collaboration with access to large amounts of climate data. It’s not a prototype environment, it’s not a development environment―it’s a real production infrastructure, delivering a lot of data to the community in a very effective way. It allows scientists to focus on science rather than on the excruciating details of how data is organized, formatted, or transferred.” “I’m tremendously pleased with ESG,” adds Gary Strand, a software engineer and data manager in NCAR’s Climate and Global Dynamics Division (CGD), who acts as a liaison between climate modelers and ESG developers. “I think back to a few years ago to when we had fewer data holdings and dealt with a small community; even then it was stretching our limits to fill data requests. These days, we’re making a lot of data available to people more efficiently than before. Researchers can get into archival systems without having an account at those sites. ESG is a groundbreaking first attempt to get these literally millions of files and terabytes of data out to the wide world; it’s a resounding success.” Managing enormous datasets
William Collins, a CGD scientist who developed one of the first techniques for integrating aerosol data into global climate models, serves on the CCSM Scientific Steering Committee. He notes that the dataset generated by CCSM―a total of about 100 terabytes worth of simulation—is the largest ever generated by a community model from NCAR. ESG is playing an increasingly important role in helping researchers find patterns and meaning in these data. “Traditional tools for managing datasets break down once you reach the volume of model simulations that we now attain,” Collins says. “We’re working with ESG software engineers and computer scientists to figure out how best to provide an entry point and a hierarchical method for exploring these enormous datasets. This will make it much easier for us to intercompare features in different simulations—for example, to look at how the physics of the climate respond to different climate change scenarios. “ESG capability will become even more critical once we create our first-generation Earth System model, otherwise known as CCSM4. The output generated by that model will be so large and rich that we will need to partner closely with ESG to exploit the data ourselves, as well as to provide the data as a resource to the wider community.” ESG and other scientific projectsOther projects are benefiting from ESG innovations, says Fox, whose team in NCAR's High Altitude Observatory helped to develop OPeNDAP-g, a Grid extension to OPeNDAP, for ESG. (OPeNDAP, which stands for the Open-source Project for a Network Data Access Protocol, is a widely used protocol for scientific data networking.) “DOE’s investment in ESG has had a scientific impact, a technical impact, and a broader impact, allowing us to build production software that we can actually use in other efforts,” he says. “The work we’ve done with OPeNDAP-g has been reintegrated back into the community release of OPeNDAP, so the entire community outside ESG will reap the benefits of the improvements. That’s everyone from the ocean sciences and other atmospheric sciences to space sciences. It's being used by a large number of researchers all over the world, as far as Australia, Europe, and Japan." The Virtual Solar-Terrestrial Observatory, a National Science Foundation–funded joint project of NCAR and McGuinness Associates, is also utilizing OPeNDAP-g as well as the ESG catalog and portal infrastructures, as are the Semantically Enabled Science Data Integration (SESDI) and the Sun-Earth Connection Distributed Data Service (SECDDS) projects from NASA. The Portal User Registration Service (PURSe), an NSF middleware initiative, has adopted elements of ESG’s security design. ESG has had productive collaborations and interactions with a number of national and international groups, including the University Corporation for Atmospheric Research’s Unidata program, the Earth System Modeling Framework (ESMF), the Global Organization for Earth System Science Portals (GO-ESSP), the National Operational Model Archive and Distribution System (NOMADS) of the National Oceanic and Atmospheric Administration (NOAA), NOAA’s Geophysical and Fluid Dynamics Laboratory, the British Atmospheric Data Centre (BADC), NSF’s Linked Environments for Atmospheric Discovery (LEAD) project, and the Geosciences Network (GEON). In addition, ESG partners ORNL and NCAR are both now members of the NSF TeraGrid effort. ESG will be exploring opportunities with the TeraGrid community to collaborate and expand capabilities even more. Making research easier, better, and fasterNetworked knowledge increases the rate of scientific discovery. By fostering multidisciplinary partnerships, ESG encourages the study of complex problems such as climate variability from many perspectives. ESG’s system of interlinked data and resources helps researchers to do their work easier, better, and faster.
“Originally, we just wanted to build a system to make climate model data available to the world,” says Middleton. “But what happened was that we have built the beginnings of a science gateway, where models, source code, initialization datasets, post-processing applications, and analysis and visualization tools are all in one place, for use by a common user community. And all with access control and metrics. It’s a delightful outcome, and not particularly expected.” That could be just the beginning. “Right now, we’re looking at what the scientific landscape is going to be in 2012,” Middleton says. “We’re planning a comprehensive, next-generation cyberinfrastructure that spans data management, access, analysis, and visualization in a distributed knowledge environment.” Such an infrastructure could extend the frontiers of climate research and lead to the solution of some of today’s most compelling scientific mysteries. —Lynda Lester Photos: Lynda Lester, NCAR/CISL The Scientific Computing Division (SCD) of the Computational and Information Systems Laboratory (CISL) is part of the National Center for Atmospheric Research (NCAR) in Boulder, Colorado. NCAR is operated by the University Corporation for Atmospheric Research under the primary sponsorship of the National Science Foundation. |
||||||||||||||||||||||||