CISL 2007 annual report banner

TeraGrid integration

 
 
Solar physics reseach uses TeraGrid

The cover of the 2007 TeraGrid Science Highlights brochure showcases a visualization of giant cell convection patterns beneath the surface of the sun. These processes, revealed by a recently developed model that allows scientists to examine inner workings of the sun that are hidden from any current observational technique, are being explored by researchers at the University of Colorado and NCAR using terabytes of data that reside at the Pittsburgh Supercomputing Center and the San Diego Supercomputer Center. Using NCAR's TeraGrid network node and VAPOR software, this new ability to explore remote data via the TeraGrid holds potential to significantly advance U.S. scientists' ability to rapidly pursue research questions that demand large-scale resources. (Image by Mark Miesch, NCAR.)

 
NCAR's TeraGrid computing resource, frost

Frost is the 2,048-processor IBM Blue Gene/L system at NCAR. One quarter of this resource has been allocated to the TeraGrid, amounting to 4.5 million CPU hours of computer time annually. The system has been in production as a TeraGird resource since August 1, 2007. Frost is attached to a storage cluster, a visualization node, and has access to the multi-petabyte NCAR Mass Storage System. As a TeraGrid Resource Provider, NCAR is committed to offering a network of computational, data, and knowledge resources to multidisciplinary groups of researchers, students, educators, policy makers, and impact and assessment communities around the world.

 

The National Center for Atmospheric Research (NCAR) has deployed a portion of its IBM Blue Gene/L (BG/L) supercomputer, named frost, on the TeraGrid. Frost has been an operational TeraGrid resource since August 1, 2007, and is expected to provide 4.5 million CPU hours annually to the TeraGrid research community.

The operational integration of the BG/L system frost with the TeraGrid involved extensive work by CISL. CISL engineers deployed frost outside the UCAR security perimeter after implementing extensive security measures, installing TeraGrid software, and automating the exchange of accounting data between NCAR's accounting system and the TeraGrid's.

In addition to the computational resources, NCAR is also testing experimental systems and services on the TeraGrid. These include the wide-area versions of parallel file systems from IBM and Cluster File Systems, as well as a remote data visualization capability based on the VAPOR tool, an open source application developed by NCAR, the University of California at Davis, and Ohio State University under the sponsorship of the National Science Foundation.

Operated in partnership with the University of Colorado, frost is the second BG/L system on the TeraGrid, joining the San Diego Supercomputer Center's 6,144-processor system. The NSF TeraGrid uses high-performance networks to integrate supercomputers, data archives, and data analysis facilities around the country. Its coordinated work environment enables researchers throughout the United States to collaborate on especially challenging scientific questions, and to process vast amounts of data that would not be manageable on smaller or isolated computing systems.

This effort supports NCAR's strategic priorities of "Developing and providing advanced services and tools" and "Engaging a broader and more diverse community." NCAR's participation in the TeraGrid is supported through NSF Core funds and UCAR Communications Pool indirect funds.


TeraGrid 2007 detailed accomplishments

The NCAR TeraGrid integration effort in FY2007 shifted from the equipment acquisition and deployment phase that dominated FY2006 to a new phase characterized by security-hardening TeraGrid components, CTSS software deployment, testing, and migration, and integration of accounting software. During this period, testing of the storage cluster capabilities expanded to experimentation with the capabilities of grid technologies to support wide-area parallel file systems and distributed scientific visualization workflows. The effort culminated on August 1, 2007 with the successful deployment of 25% of the frost resource on the TeraGrid.


TeraGrid security

On October 22, 2006 frost was moved outside the UCAR security perimeter. Before this move could be contemplated, the operating system on the frost front-end login and service nodes had to be hardened, or configured to minimize computer security vulnerabilities to hacker exploits. Staff in CISL's Research Systems Evaluation Team (ReSET) also acquired equipment for setting up an intrusion-detection system.


TeraGrid user support

The CISL Consulting Services Group integrated its NCAR-based trouble ticket system with the TeraGrid Ticket system. As early as May 11, 2007, NCAR began receiving and responding to user support requests. In FY2007 NCAR also integrated its documentation of the frost system with the overall TeraGrid documentation system.


TeraGrid CTSS software

The integration of the Cobalt scheduler on frost with the GRAM resource management system was completed by March 2007. By April 2007 the NCAR TeraGrid team had completed the initial installation of all the most important CTSS-3 TeraGrid software stack components and began testing them. John White at SDSC was of great assistance installing and testing the INCA grid monitoring package. This CTSS installation and testing phase was complicated by the migration of CTSS-3 to CTSS-4 during the summer of 2007. CISL installed numerous CTSS independent subcomponents (kits), including Remote Login, Remote Compute, Data Transfer (e.g. gridftp), Data Management, Wide Area Parallel Filesystem (e.g. GPFS-WAN), Application Development and Runtime Support, and Science Workflow Support. The Parallel Application Support (e.g. MPI, MPICH-G2) kit was not supported because inherent limitations of the BG/L system precluded it.


Integration of NCAR and TeraGrid accounting systems

Integration with the TeraGrid accounting system is a key requirement. The TeraGrid Connector Code (TGCC), a Java-based software application developed by CISL, integrates NCAR's allocation accounting database with the TeraGrid Central Database. TGCC has enabled NCAR to provide accounting information to the TeraGrid from its BG/L supercomputer and Mass Storage System.

TGCC accepts incoming packets from the TeraGrid that describe project and account requests and updates. TGCC then synchronizes these with NCAR's accounting database and relays the resulting data to frost and the MSS. This mechanism provides access to the system for authenticated TeraGrid users. The result is a virtually seamless ability for TeraGrid users to request NCAR's frost resource via the centralized TeraGrid User Portal and allocation panel, then be automatically set up with access. Their usage is reported back to the TeraGrid Central Database, which they can view on the TeraGrid User Portal.


Visualization node deployment

The NCAR TeraGrid visualization resource named "twister" was procured, security hardened, and deployed in February 2007. Twister was used to provide VAPOR-based visualization services collaboration with researchers at the University of Colorado. Studies of the solar interior were run at SDSC. Hundreds of gigabytes of data were created there by CU astrophysicists running the Anelastic Spherical Harmonic (ASH) model. These data were mounted at NCAR using GPFS-WAN and processed into a hierarchical storage format by VAPOR software running on twister. Data in this form could then be visually explored by CU astrophysicists using a simple laptop computer.


Other infrastructure improvements

In late FY2007, CISL made infrastructure improvements to ensure enhanced security, redundancy, and fault tolerance in our TeraGrid cyberinfrastructure. These improvements include a 10 Gbps intrusion-detection system, additional disk to mirror critical system components, and redundant power supply systems.


SRB connection and cross-archiving of priceless datasets

With the assistance of SDSC staff, NCAR installed the Storage Resource Broker (SRB) software necessary to begin backing up files to the SDSC mass storage file system. The archiving of a portion of NCAR's Research Data Archive began in April 2007. The average file size in these collections was just over 200 MB. To date 288,484 files have been transferred, amounting to 59.5 TB of valuable NCAR data now duplicated at SDSC.


TeraGrid background

TeraGrid is an NSF-funded national facility that integrates computational and data resources and security, accounting, documentation, and educational outreach services from resource providers (RPs) to serve the nation's science and engineering community. Common services and integration processes and components are provided by, or in some cases coordinated by, the Grid Infrastructure Group (GIG). The GIG is responsible for architecture, planning, managing, and enhancing the TeraGrid facility, providing a core set of services, and coordinating RP staff through distributed service teams ranging from user support to security to education, outreach, and training.

The objective of the RPs and GIG is to enable scientific discovery by providing integrated access to the highest-performance resources available, integrated as a coordinated system that supports various use cases ranging from exploiting a single TeraGrid resource to combining resources in specialized workflow or cooperative computing modes. Resource integration and enhancement efforts are ranked through user input and evaluations of TeraGrid services as measured by operational, system, or service use metrics.

The set of long-term TeraGrid objectives toward providing cyberinfrastructure to national science and engineering researchers can be expressed in three interdependent sets of activities.

TeraGrid DEEP encompasses a set of initiatives aimed at fully exploiting the integrated capabilities of the TeraGrid facility to support scientific discovery that would not otherwise be possible. The GIG coordinates user support staff to provide both traditional user consulting support and a program called Advanced Support for TeraGrid Applications (ASTA). ASTA assigns user support staff to dedicate 25% of their time for 6-12 months assisting a science group to enable them to fully harness TeraGrid services and resources as an integrated facility.

TeraGrid WIDE recognizes that, traditionally, NSF's high-performance computing infrastructure has focused primarily on only a small fraction of the national science and engineering community. Thus, in addition to supporting a current and growing user community, the aim is to provide TeraGrid services to many more scientists and engineers over the coming years. Such scaling requires a new model for interacting with the community and for provisioning cyberinfrastructure: the creation of science gateways.

TeraGrid's broad-impact goals also extend to students and educators. TeraGrid's Education, Outreach, and Training (EOT) program is a coordinated effort to raise the awareness of the benefits of TeraGrid within research and education communities across all disciplines and all learning levels. The EOT team works closely with the science gateways to engage significantly larger numbers of scientists, educators, and students, with an emphasis on reaching out to under-represented groups.

TeraGrid OPEN involves the provision of a persistent, reliable national cyberinfrastructure. The TeraGrid facility is architected as a set of integrated services based on open standards wherever possible and embracing the heterogeneity represented by nearly 20 unique major resources operated by TeraGrid RPs. OPEN also describes the approach to presenting TeraGrid to NSF and the community as a truly extensible and adaptable facility.


Strategic overview

Continued operation of TeraGrid cyberinfrastructure is a strategic and ongoing activity for CISL. Subsequent out-year upgrades of the TeraGrid infrastructure will be accomplished with CISL's research equipment budget. While modest, this investment should enable CISL and NCAR to continue deploying resources of a scale sufficient to develop Grid expertise and learn vital lessons about providing domain-specific Grid services to NCAR's scientific community.

In particular, plans are already in place to run a procurement process for a BG/L replacement system in FY2008. The procurement will be a two-stage process. In the first stage, an RFI will be run to establish the feasibility of a procurement of this size with the constraints on power, space, and cooling that will be available at NCAR's Mesa Lab facility after the deployment of the IBM POWER-6 system during the second phase of the ICESS procurement. Based on the information gathered in the RFI, the BG/L replacement RFP is scheduled to be run in spring 2008.

As new scientific collaborations and new services emerge on the TeraGrid, CISL will adapt the NCAR TeraGrid appropriately. A collateral goal will be to develop a cadre of NCAR users willing and able to use the TeraGrid, then to use their experiences to guide the development of both the NCAR node and the TeraGrid as a whole.


Project plan evaluation measures

In FY2008, NCAR intends to continue to operate the NCAR TeraGrid resource, offering 4.5 million CPU hours to TeraGrid users during the year—25% of the resource. Accounting for these CPU hours is made in Service Units (SU) that are allocated by three committees: the Large Resource Allocations Committee (LRAC) makes awards larger than 500,000 SUs, the Medium Resource Allocations Committee (MRAC) makes awards of 0-500,000 SUs, and the Development Allocations Committee (DAC) makes awards of 0-30,000 SUs for new users and users investigating new architectures. NCAR's allocation plan for its BG/L during FY2008 is:

  • LRAC: 2M SU/year, 1M SU/meeting x 2 times/year

  • MRAC: 2M SU/year, 0.5M SU/meeting x 4 times/year

  • DAC: 0.5M SU/year

 

Building a TeraGrid user base for the BG/L system at NCAR while procuring a replacement system for it will be CISL's primary goals for the TeraGrid activity in FY2008. Existing NCAR and University of Colorado users of the BG/L system will be encouraged to use it as TeraGrid users. We expect that additional allocation cycles will lead to new external users with large allocations. In FY2008 the following specific milestones will be met:

  • CISL will resolve the incompatibilities between the UCAR and TeraGrid security models that have prevented TeraGrid access to the ESG data holding at NCAR.

  • NCAR accounting services will implement support for a new storage accounting model currently being developed by TeraGrid working groups.

  • ReSET will complete the integration of a TeraGrid queue wait time prediction system.

 

Also in FY2008, NCAR will continue to develop a portfolio of TeraGrid-based scientific and technical partnerships. In particular, NCAR has established partnerships with ORNL, PSC, and the University of Indiana for conducting Lustre-WAN testing with these TeraGrid resource providers. This testing and development is expected to continue. A similar working relationship exists with SDSC regarding GPFS-WAN and experimental pNFS filesystem deployment, development, and use.

NCAR will continue collaborating with SCEC and CU to leverage useful properties of VAPOR software. Once the security issues are resolved—allowing ESG transfers over the TeraGrid's 10 Gbps fabric—CISL will continue with its ESG-over-TeraGrid federation activities with ORNL, and will approach Purdue's Climate Center to develop new areas of collaboration in science gateway development. NCAR will also pursue the development of a new science gateway in astroseismology in collaboration with NCAR's High Altitude Observatory Division.


Impacts

The impacts of integrating with the TeraGrid are already occurring. In particular:

  • So far, both NCAR and the TeraGrid have learned from each other. For example, NCAR has provided important guidance for how to integrate new resources. The TeraGrid has introduced NCAR to wide-area parallel filesystems such as GPFS-WAN.

  • Integration with TeraGrid has created new, and in many cases, unexpected opportunities for scientific collaboration. For example, current collaborations with SCEC and CU would not have happened.

  • Domain-specific and multidisciplinary computing models can complement each other. This is illustrated through science gateways activities such as the Earth System Grid.

  • The sum appears to be greater than the individual parts.

 

NCAR's access to and integration with TeraGrid resources will help ensure the consistency and integration of the NSF's cyberinfrastructure plans, particularly between the Office of Cyberinfrastructure (OCI) and the Geoscience Directorate. The connection itself is expected to increase the ability of NCAR scientists and geoscientists to collaborate using TeraGrid resources. The resulting collaborations will likely center around data exchanges at first, but will inevitably expand into other aspects of scientific workflows such as the sharing or coscheduling of HPC resources, like those demonstrated at SC06 in Tampa, Florida.


Sponsorship

TeraGrid activities at NCAR are supported by NSF Core funding.