Production Linux cluster accomplishmentsContinuing the aggressive evaluation and deployment of potentially more cost-effective new computing technologies as part of SCD's five-year strategic plan, three new Linux-based supercomputer clusters were deployed in FY2005. Lightning
The CAM and POP benchmarks demonstrated that lightning will outperform bluesky by a factor of 1.3 or more on a per-processor basis. Lightning was placed into production in FY2005 after many months of working with the vendor to stabilize the system software, working with users to develop a usable user environment, and helping users port their applications to the Linux operating system. Pegasus
The pegasus system is identical to lightning but only one-half the size, enabling lightning to be a backup system during production AMPS forecasting runs. Pegasus is targeted to begin production AMPS runs in early FY2006. LinuxLightning and pegasus are SCD's first Linux-based supercomputers, providing SCD an opportunity to work with the Linux operating system, the Myrinet interconnect technology, as well as IBM's GPFS parallel filesystem in a Linux environment. Previous experience with GPFS was on the IBM SP-cluster supercomputers blackforest and bluesky running the AIX operating system. Supporting a new operating system presents a number of challenges in both the systems administration and user support arenas. Introducing a new operating system into the SCD supercomputing environment is not without precedent. Moves from the Cray Operating System to UNICOS, then to various vendor-supplied Unix systems occurred over the past 20 years. In all cases, SCD rose to the challenge and facilitated the user transitions. The introduction of Linux, however, brought many new challenges due to the "openness" of the operating system. Vendor-supplied proprietary operating systems and system software limit the choices, and most of the software stack and user environment is fully integrated and supported by the vendor. Running applications within a group of machines from the same vendor that is running the same operating system is relatively easy. The same software is generally available on all the machines. This is not the case with the Linux environment. In the Linux environment there are a greater number of software stack choices, complicating the successful integration and support of the software. SCD successfully completed the system software and user environment integration; this required a significant effort. The Supercomputing Services Group (SSG) and Consulting Services Group (CSG) worked closely with multiple vendors and the user community to establish a stable, usable production environment and to port key applications to lightning. Establishing the user environment created new challenges. Linux opened the door to a plethora of third-party software including compilers. A new batch scheduling package is used in place of a locally written one. The user community requested both 32-bit and 64-bit compiler support. New training classes and documentation were created. And the list goes on. Through enormous efforts from SSG and CSG, lightning was placed into production in FY2005 and is now running at over 70% utilization.
CoralThe Institute for Mathematics Applied to the Geosciences (IMAGe) requires computing resources to perform geostatistics analyses, data assimilation calculations, and turbulence simulations. Prior to the creation of IMAGe in 2004, the programmatic elements that would become IMAGe used a variety of computational resources scattered across multiple science divisions within NCAR. In the fall of 2004, after the integration of IMAGe with SCD to form the Computational and Information Systems Laboratory (CISL), a survey of IMAGe computational needs led SCD and IMAGe to collectively realize that, by pooling software engineering and system administration resources, a small, new cluster, with perhaps 32 processors, could be acquired. This cluster would serve to both consolidate IMAGe's computing assets and better match its evolving computational requirements. Subsequently, SCD technical staff began to work closely with IMAGe scientists to develop these requirements for a new cluster platform in greater detail. As the plan developed, it became increasingly clear that the Institute's turbulence visualization activities matched well with the development activities of the SCD component of the VAPOR project, an NSF Information Technology Research (ITR) grant aimed at advancing the state of the art in time-varying data analysis. The strengths of combining the two activities were obvious and compelling. Computationally, both the Data Assimilation Initiative in IMAGe and the VAPOR project in SCD had a high degree of familiarity with commodity Intel hardware and Intel/Linux-based compilers and visualization tools. Accordingly, the Intel® Extended Memory 64 Technology (EM64T, or "Nacona") processors were selected for the IMAGe cluster. Additionally, the demanding inter-processor communication bandwidth requirements of turbulence simulation, as well as SCD's interest in evaluating new interconnect technology, argued for including a high-bandwidth commodity interconnect solution, such as InfiniBand®. The other aspects of the system design, such as memory size and attached RAID disk storage, were appropriately sized to accommodate the demanding requirements of turbulence simulation. The cluster scientific requirements were translated into a system technical description, and an RFP was released in December 2004. Responses from a variety of vendors were collected and evaluated. Negotiations were conducted with Aspen Systems, Inc., and an award was made on March 31, 2005. The vendor began constructing the cluster for delivery to NCAR by mid-May 2005. Two factors delayed delivery of the system into the summer of 2005. One was the availability of system components in the vendor's supply chain, which were eventually solved in mid-July. The second was a technical issue which arose as the cluster came together; that being disappointing performance of the InfiniBand® network interface cards when plugged into EM64T motherboards with PCI-X I/O bus technology. Ultimately, satisfactory InfiniBand® interconnect performance could only be achieved by a wholesale replacement of these motherboards with ones having PCI Express® I/O bus technology. After this refitting of the system was done, asymptotic bandwidths of 980 MB/sec and zero byte latencies of 3.9 µsec were observed using OSU ping-pong tests. After all technical issues related to performance and preparation of the cluster for delivery had been addressed, the IMAGe cluster, named coral, was delivered to NCAR on August 16, 2005 and passed acceptance testing on September 3, 2005. The system was then turned over to SCD/IMAGe system administration staff for on-site configuration prior to making it available to scientific users.
|
|
|||||||