Experimental Computing Systems: Blue Gene/L
In March 2005, NCAR became one of the first sites in the world to receive an IBM Blue Gene/L (BG/L) supercomputing system. The system, named frost, consists of a single BG/L rack (2,048 compute processors, 64 I/O processors, 5.73 TFLOPS peak) and appears as the 61st fastest computer in the world in the 25th Top500 List (released in June 2005). Frost is an experimental system supporting 12 researchers from NCAR, the University of Colorado at Boulder, and the University of Colorado at Denver who are investigating and addressing the technical obstacles to achieving practical petascale computing in geoscience, aerospace engineering, and mathematical applications. The opportunity to experiment with systems like BG/L is absolutely essential for NCAR to maintain its ability to provide capability and capacity supercomputing to the community. Moreover, low-power systems like BG/L (only 25 KW for 5.73 TFLOPS) offer the promise of significantly reducing the strain on the NCAR Mesa Lab's computing facility.
To consolidate experimental system research in CISL, the Research Systems Evaluation Team (ReSET) was formed in late 2005. The mission of ReSET is to administer and evaluate strategically selected experimental systems for the Laboratory in such a way as to gain the maximum knowledge of and impact from emerging technologies. ReSET is housed in CISL's Computational Science Section (now renamed Computer Science Section), but collaborates with staff members from other sections and groups across CISL to accomplish its mission. In mid-November 2005, frost became the first experimental system managed by ReSET.
During FY 2006, members of ReSET have worked with the frost user community to significantly increase both the number and breadth of applications capable of running on frost. One example of the impact of this effort is the scaling of POP to 28,972 processors on a BG/L system at IBM's T.J. Watson Research Center. Though frost is an experimental system, ReSET's success in expanding the user and application code base has produced system usage levels that are similar to those seen on the production supercomputing systems managed by CISL. In addition to providing user support, the team continues to work through the Blue Gene consortium and SP-XXL to improve the BG/L system software stack and influence development of the software stack for the follow-on system, BlueGene/P. One example of this effort is the collaboration with Argonne National Laboratory to further develop Cobalt, the queuing system currently being used on frost, by incorporating alternate scheduling strategies.
In FY 2007, frost will move outside the UCAR security perimeter and become a TeraGrid resource. This presents numerous cyberinfrastructure integration challenges (e.g., how to integrate Cobalt, or another scheduler, into the TeraGridís Coordinated TeraGrid Software and Services software suite) and opportunities (e.g., the ability to provide the newly acquired Lustre storage system as a tightly integrated resource to frostís user community). An additional challenge is that frost will enter its third year of service in March 2007, and planning for a successor system is needed.
Here is the frost wiki.
This work is made possible through NSF MRI Grants CNS-0421498, CNS-0420873, and CNS-0420985 and through the IBM Shared University Research (SUR) program with the University of Colorado. NSF Core funding also supports this system.