Director's Message  |   Executive Summary  |   Divisional Narrative   |   Publications   |   Educational Activities   |   Awards   |   Community Service   |   Staff   |   Visitors and Collaborators   |   NCAR FY2003 ASR   |
 

High performance computing

Maintaining NCAR's production supercomputer environment

The production supercomputer environment managed by SCD for NCAR has evolved over the years. During the last 19 years, SCD has brought NCAR's science into the multiprocessing supercomputer world. Prior to the introduction of the four-CPU Cray X-MP in October 1986, all modeling was performed with serial codes. Since then, the focus has been on redeveloping codes to harness the power of multiple CPUs in a single system, and most recently, in multiple systems.

During the last 19 years, SCD has deployed a series of Parallel Vector Processor (PVP) systems ranging from a 2-CPU Cray Y-MP to a pair of 24-CPU Cray J90se systems. Massively Parallel Processing (MPP) systems included the Cray T3D with 128 processors and the Thinking Machines CM2 and CM5 systems. Most recently, Distributed Shared Memory (DSM) systems have been deployed; these include the Hewlett-Packard SPP-2000, SGI Origin2000, Compaq ES40 cluster, SGI Origin3800, and the IBM SP POWER3 and POWER4 systems.

The following diagram shows the systems that SCD has deployed for NCAR's use since its inception. The systems shown with blue bars are those deployed for production purposes, those shown in red were (are) considered experimental systems.

History of supercomputing at NCAR

Supercomputing systems deployed at NCAR

In 1986, with the first multiprocessor system (the Cray X-MP/4) on NCAR's floor, SCD could deliver on average approximately 0.25 GFLOPS of sustainable computing capacity to NCAR's science. In the roughly 19 years since, that sustained computing capacity has grown significantly.

Computing power at NCAR

FY2003 production system overview

In FY2003, Phase II of the Advanced Research Computing System (ARCS) was delivered. In October 2002, SCD took delivery of Phase II of NCAR's Advanced Research Computing System (ARCS). The delivery added a complete new IBM Cluster 1600 system, called bluesky, to the SCD computational environment.

The bluesky system introduced IBM's next-generation POWER4 processor. The POWER3 and POWER4 processors can perform 4 floating point operations per clock tick. The POWER4 runs at a clock speed of 1.3 gigahertz, substantially faster than the 375-megahertz POWER3 processor. Each POWER4 processor has a peak computation rate of 5.2 gigaflops.

The bluesky system was comprised of 38 Regatta-H Turbo frames, each with 32 POWER4 processors. Thus the bluesky system has a total of 1,216 POWER4 processors and peak computation rate of 6.323 teraflops.

Also in FY2003, Phase III of the current Advanced Research Computing System (ARCS) was delivered to NCAR. This was an expansion of the IBM cluster (bluesky) by 14, 32-way p690 SMP servers (i.e., frames), with each server based on the POWER4 microprocessor and operating at a clock frequency of 1.3 GHz. Each server included 64 GB of memory. The system expansion also included 10.5 TB of formatted disk storage, which was added to the existing disk subsystem, thereby increasing bluesky's total disk capacity to 31 TB. Of the 14 servers, only 12 were added to bluesky, the remaining two servers are temporarily being used for a special SCD testbed project. As of now, bluesky is comprised of 50 POWER4 38 Regatta-H Turbo frames, making it the single largest system of this type in the world.

Initially, the 12 additional 32-way P690 SMP servers will be used for the expanded use of CCSM for contributions to the IPCC process, as noted in SCD's Annual Budget Review. The installation of the blueksy system and its subsequent augmentation has doubled the capacities of both the Climate Simulation Laboratory and Community computing.

Further, there were several major system software upgrades performed on all  supercomputers.

Supercomputer systems maintained during FY2003

DSM systems:

  • SGI Origin2100 (chinookfe), with 8 processors, was used in the Climate Simulation Laboratory.
  • SGI Origin3800 (chinook), with 128 processors, was used in the Climate Simulation Laboratory.
  • SGI Origin2000 (dataproc), with 16 processors, was used by both Climate Simulation Laboratory and Community users.
  • SGI Origin2000 (mouache), with 4 processors, was used as a test platform by SCD for evaluation of new Irix systems, libraries, and compilers prior to their installation on the production SGI platforms; all interested users now have access to mouache.
  • IBM SP (babyblue), with 64 processors, was shared by the Climate Simulation Laboratory and the Community.
  • IBM SP (blackforest), with 1,308 processors, was shared by the Climate Simulation Laboratory and the Community.
  • IBM NightHawk2 (dave), with 16 processors, was shared by the Climate Simulation Laboratory and the Community.
  • IBM p690 Regatta (bluedawn), with 16 processors, was used as a test and development platform for the integration of the IBM POWER4 Cluster 1600.
  • IBM Cluster 1600 (bluesky), with 1,600 processors, was shared by the Climate Simulation Laboratory and the Community.

Production system performance and utilization statistics

At the end of FY2003, the production supercomputer environment managed by SCD for NCAR included five IBM supercomputers and four SGI supercomputers. The following tables provide average utilization and performance statistics for the production supercomputer systems SCD operated in FY2003.

In addition, SCD publishes monthly usage reports at http://www.scd.ucar.edu/dbsg/dbs/. These reports provide summary information on system usage, project allocations, and General Accounting Unit (GAU) use.


Production systems for FY2003: Average performance and utilization statistics

System name

Hardware/ #PEs

Notes

GFLOPs

Utilz'n

User

Idle

System

WaitIO

IOfs

IOswp

babyblue IBM SP/48 CSL and Community 1.17 31.9% 24.7% 73.8% 0.8% 0.0% -- --
blackforest IBM SP/1128 CSL and Community 60.5
73.4% 61.1% 37.3% 1.3% 0.0% -- --
blackforest IBM NH2/32 CSL and Community 2.10
53.0% 45.5% 53.6% 0.8% 0.0% -- --
bluedawn IBM p690/16 Test and Development 0.59
13.9% 11.9% 86.6% 1.0% 0.0% -- --
bluesky B32Way IBM p690/800 CSL and Community 39.6
66.7% 56.1% 42.7% 1.1% 0.0% -- --
bluesky B8Way IBM p690/608 CSL and Community 110.7
79.5% 63.8% 34.9% 1.2% 0.0% -- --
dave IBM NH2/16 CSL and Community 0.11
10.1% 10.1% 86.6% 2.5% 0.2% -- --
dataproc SGI O2K/16 CSL and Community -
21.3% 21.5% 72.4% 4.7% 1.2% 85.3% 0.2%
mouache SGI O2K/4 CSL and Community --
1.3% 1.3% 97.0% 0.3% 0.6% 45.4% 35.4%
chinook SGI O3K/128 CSL --
70.3% 70.6% 28.0% 1.2% 0.0% 75.8% 0.0%
chinookfe SGI O2K/8 CSL --
32.6% 32.7% 63.3% 3.6% 0.2% 77.3% 2.7%
  Note: "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.

End-FY2003 production supercomputer systems

The SCD supercomputer resources are comprised of two separate computational facilities: the Climate Simulation Laboratory (CSL) and Community Computing facilities. Some systems, such as the IBM SP systems, the dave system, and the dataproc system are shared between these two facilities. The following sections describe the supercomputer systems available in these two facilities.

CSL facility

The Climate Simulation Laboratory facility provided the following supercomputer resources at the end of FY2003:


Climate Simulation Lab facility, FY2003 configuration

  System

# CPUs

GB
memory

Peak
GFLOPS

Notes

Dedicated: IBM SP (blackforest) 560 280 840.0 1,120 total system batch CPUs; 560 dedicated to CSL
Dedicated: IBM SP
(bluesky)
704 1408
3660.8 1,408 total system batch CPUs; 704 dedicated to CSL
Dedicated: SGI Origin3800 (chinook) 124 64 124.0 124 CPUs dedicated to CSL
Dedicated: SGI Origin2100 (chinookfe) 8 8 4.0 Front-end system for chinook
Shared: IBM SP (babyblue) 48 24 72.0 Shared new-release test platform; available for user use
Shared: SGI Origin2000 (dataproc) 16 32 8.0 Shared with Community for data analysis and post-processing applications
Shared: IBM NightHawk2 (dave) 16 32 24.0 Shared with Community for data analysis and post-processing applications

Community Computing facility

The Community Computing facility provided the following supercomputer resources available at the end of FY2003:


Community Computing facility, FY2003 configuration

  System

# CPUs

GB
memory

Peak
GFLOPS

Notes

Dedicated: IBM SP (blackforest) 560 280 840.0 1,120 total system batch CPUs; 560 dedicated to Community
Dedicated: IBM SP
(bluesky)
704 1408
3660.8 1,408 total system batch CPUs; 704 dedicated to Community
Shared: IBM SP (babyblue) 48 24 72.0 Shared new-release test platform; available for user use
Shared: SGI Origin2000 (dataproc) 16 16 8.0 Shared with CSL for data analysis and post-processing applications
Shared IBM NightHawk2 (dave) 16 32 24.0 Shared with CSL for data analysis and post-processing applications

Key maintenance activities

During FY2003, SCD provided ongoing maintenance activities to ensure the integrity and reliability of existing computational systems and improved the quality of service to the NCAR user community. Some of the key areas were:

Maintain supercomputer operating systems
SCD stayed apprised of major software releases from IBM and carefully scheduled upgrades to the production systems and product set software based on the judged stability of those upgrades in the NCAR production environment. SCD also continued to provide major system support for the SGI Origin3800 and Origin2000 systems.

Maintain stability and reliability of systems
One of the most significant attributes of the NCAR computational environment is its overall stability and reliability. For instance, the NCAR Mass Storage System has a reputation for reliability, and SCD has in the last year deployed a number of high-availability fileserver systems. This reliability and stability does not come easily; it stems from a combination of choosing reliable, stable vendor products and using proven, fail-safe system administration and maintenance techniques. SCD will continue to focus on ensuring, in whatever ways possible, highly stable and reliable systems and systems operations.

System monitoring
Over the years, SCD has developed a large number of system monitoring procedures, techniques, and tools. SCD continued to enhance and utilize its collective experience to maintain the stability of the existing production systems through this proactive monitoring. In addition, SCD continued to enhance its monitoring tools, techniques, and procedures, and SCD automated a number of procedures for detecting system failure or trouble. This automation was integrated with commercial alphanumeric paging technology to provide more rapid alert mechanisms to SCD operations and systems staff and thus reduce the amount of time that systems are unavailable to the NCAR user community when they do fail.

Next page   |   Table of contents