CISL 2007 annual report banner

Experimental computing systems: Blue Gene/L

 
 
Blue Gene/L processor rack

The IBM Blue Gene/L computer "frost" at NCAR was innovative in more ways than just the shape of its cabinet when it was installed in 2005. Its high-performance, low-power-consumption design had great potential, but the system required R&D to become a useful tool for geosciences research. That work has been very successful, and frost is now performing at a high level in multiple roles for universities, NCAR, and the TeraGrid.

 

In March 2005, NCAR became one of the first sites in the world to receive an IBM Blue Gene/L (BG/L) supercomputing system. The system, named frost, consists of a single BG/L rack (2,048 compute processors, 64 I/O processors, 5.73 teraflops peak) and appeared as the 61st fastest computer in the world in the 25th Top500 List (released in June 2005). Frost was an experimental system to support researchers from NCAR, the University of Colorado at Boulder, and the University of Colorado at Denver who are investigating and addressing the technical obstacles to achieving practical petascale computing in geoscience, aerospace engineering, and mathematical applications. The opportunity to experiment with systems like BG/L is absolutely essential for NCAR to maintain its ability to provide capability and capacity supercomputing to the community. Moreover, low-power systems like BG/L (only 25 KW for 5.73 teraflops) offer the promise of significantly reducing the strain on the NCAR Mesa Lab's computing facility.

To consolidate experimental coumputer system research in CISL, the Research Systems Evaluation Team (ReSET) was formed in late 2005. The mission of ReSET is to administer and evaluate strategically selected experimental systems for CISL to gain maximum knowledge of and impact from emerging technologies. ReSET is housed in CISL's Computer Science Section, but it collaborates with staff members from other sections and groups across CISL to accomplish its mission. In mid-November 2005, frost became the first experimental system managed by ReSET.

During FY2007, members of ReSET continued working with the frost user community to significantly increase both the number and breadth of applications that can be run on frost. One example of this effort's impact is the development of a version of CCSM capable of exploiting massively parallel computing platforms like BG/L. While frost was an experimental system, ReSET's success in expanding the user and application code base produced system usage levels that are similar to those of the production supercomputing systems managed by CISL.

In addition to providing user support, the ReSET team continues to work through the Blue Gene consortium and SP-XXL to improve the BG/L system software stack and influence development of the software stack for the follow-on system, Blue Gene/P. One example of this effort is the collaboration with Argonne National Laboratory to further develop Cobalt, the queuing system currently being used on frost, by incorporating alternate scheduling strategies and interfacing it with the Coordinated TeraGrid Software and Services software stack.

Frost became a production TeraGrid resource on August 1, 2007: 25% of its cycles are devoted to supporting NSF TeraGrid science activities. In FY2008, frost will continue to be used both as an experimental research system for university and NCAR users and as a production computing resource on the TeraGrid.

The role frost plays for the research commmunity now addresses three of NCAR's strategic priorities: "Conducting computer science, computational science, applied mathematics, statistics, and numerical methods R&D," "Developing and providing advanced services and tools," and "Enhancing capability and capacity of NCAR supercomputing." This work is made possible through NSF MRI Grants CNS-0421498, CNS-0420873, and CNS-0420985, and through the IBM Shared University Research (SUR) program with the University of Colorado. NSF Core funding also supports this system.