SCD ASR header SCD ASR header

Supercomputing advances in FY2000

Maintaining NCAR's production supercomputer environment

The production supercomputer environment managed by SCD for NCAR has evolved over the years. During the last 16 years, SCD has brought NCAR's science into the multi-processing supercomputer world. Prior to the introduction of the four-CPU Cray X-MP in October 1986, all modeling was performed with serial codes. Since then, the focus has been on redeveloping codes to harness the power of multiple CPUs in a single system and, most recently, of multiple systems.

During the last 16 years, SCD has deployed a series of parallel-vector processor (PVP) systems ranging from a 2-CPU Cray Y-MP to a pair of 24-CPU Cray J90se systems. Massively parallel (MPP) systems included the Cray T3D, with 128 processors and the Thinking Machines CM2 and CM5 systems. Most recently, distributed shared memory (DSM) systems have been deployed; these have included the Hewlett-Packard SPP-2000 and Silicon Graphics Origin2000, the IBM SPs, and the Compaq ES40 cluster.

The following diagram shows the systems that SCD has deployed for NCAR's use since its inception. The systems shown with blue bars are those deployed for production purposes, those shown in red were (are) considered experimental systems.

Supercomputing systems deployed at NCAR

NCAR supercomputing history

In 1986, with the first multiprocessor system (the Cray X-MP/4) on NCAR's floor, SCD could deliver on average approximately 0.25 GFLOPS of sustainable computing capacity to NCAR's science. In the roughly 16 years since, that sustained computing capacity has grown significantly.

Computing power at NCAR in sustained GFLOPS

NCAR supercomputing productivity

FY2000 production system overview

There were some changes made to the production supercomputer environment during FY2000. Most notable was the installation of the Compaq ES40 cluster (prospect). The Compaq ES40 cluster system is comprised of 9 ES40 nodes, and each node has 4 CPUs and 4 GB of memory. The ES40 cluster was delivered to NCAR during the week of 8 November 1999, and SCD installed the system by 12 November 1999. The ES40 cluster successfully passed the 30-day acceptance testing on 6 April 2000. Since then prospect has been a semi-production system for the NCAR and University Communities.

SCD also installed an SGI Origin2000 (utefe). The new Origin2000 is comprised of 8 CPUs and 8 GB of memory. Utefe was deployed as a front-end system to the larger Origin2000 (ute, 128 CPUs), and ute is now used only as a batch processing system. All users who want to run work on ute have to log in and submit their batch jobs via utefe. Utefe was installed on 15 October 1999 and has been serving the CSL community.

Supercomputer systems installed during FY2000

DSM, Distributed Shared Memory systems

SCD continued to maintain and enhance its production supercomputer systems during FY2000. These included the new Distributed Shared Memory (DSM) systems installed in recent years as well as the older Parallel Vector Processor (PVP) systems.

Supercomputer systems maintained during FY2000

DSM, Distributed Shared Memory systems

PVP, Parallel Vector Processor systems

Production system performance and utilization statistics

At the end of FY2000, the "production supercomputer environment" managed by SCD for NCAR includes two Cray supercomputers, two IBM supercomputers, four SGI supercomputers, and one Compaq nine-node ES40 cluster. The following tables provide average utilization and performance statistics for the supercomputer systems SCD operated in production during FY2000.

In addition, SCD publishes monthly usage reports at http://www.scd.ucar.edu/dbsg/dbs/
These reports provide summary information on system usage, project allocations and General Accounting Unit (GAU) use.

Production systems for FY2000: Average performance and utilization statistics
System name Hardware/ #PEs Notes GFLOPS Utilz'n User Idle System WaitIO IOfs IOswp
babyblue IBM SP/40 CSL & Community -- 46.5% 46.9% 52.2% 0.6% 0.2% -- --
blackforest IBM SP/512 CSL & Community -- 62.3% 63.3% 35.7% 0.9% 0.2% -- --
prospect Compaq ES40 cluster/36 Community, installed 11/12/99 -- 35.0% 36.1% 61.2% 3.0% -- -- --
chipeta Cray J90se/24 Community 1.531 89.2% 90.3% 6.0% 3.6% 0.5% -- --
dataproc SGI O2K/16 CSL & Community -- 44.4% 44.6% 43.3% 7.7% 4.1% 85.0% 2.6%
mouache SGI O2K/4 CSL & Community -- 28.5% 28.5% 69.1% 1.5% 0.7% 24.7% 15.2%
ouray Cray J90se/24 Community 1.449 89.9% 91.1% 4.2% 4.7% 0.6% -- --
ute SGI O2K/128 CSL -- 70.4% 70.8% 26.8% 1.7% 0.6% 80.6% 1.0%
utefe SGI O2K/8 CSL, installed 10/15/99 -- 32.2% 32.4% 63.1% 2.4% 1.9% 51.8% 0.0%

Note: "GFLOPS" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.

Production systems decommissioned during FY2000: Average performance and utilization statistics
System name Hardware/ #PEs Notes GFLOPS Utilz'n User Idle System WaitIO IOfs IOswp
antero Cray C90/16 CSL, decommissioned 11/30/99 4.654 91.0% 93.1% 3.6% 3.2% 0.4% -- --
aztec Cray J90/20 Community, decommissioned 6/30/2000 1.125 88.1% 89.6% 7.8% 2.6% 0.4% -- --
paiute J90/16 Community, decommissioned 6/30/2000 0.868 77.0% 77.9% 18.2% 3.9% 4.1% -- --

Note: "GFLOPS" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.

End-FY2000 production supercomputer systems

The SCD supercomputer resources are comprised of two relatively separate computational facilities: the Climate Simulation Laboratory (CSL) and Community Computing facilities. Some systems, such as the new IBM SP systems and the "dataproc" system are shared between these two facilities. The following sections describe the supercomputer systems available in these two facilities.

CSL facility

The Climate Simulation Laboratory facility provided the following supercomputer resources at the end of FY2000:

  System # CPUs GB memory Peak GFLOPS Notes
Dedicated: IBM SP (blackforest) 256 128 384.00 512 total system batch CPUs; 256 dedicated to CSL
Dedicated: SGI Origin2000 (ute) 128 16 64.0  
Dedicated: SGI Origin2000 (utefe) 8 8 4.0 Front-end system for ute
Shared: IBM SP (babyblue) 48 24 72.0 Shared new-release test platform; available for user use
Shared: SGI Origin2000 (dataproc) 16 32 8.0 Shared with Community for data analysis and post-processing applications

Community facility

The Community facility had available the following supercomputer resources at the end of FY2000:

  System # CPUs GB memory Peak GFLOPS Notes
Dedicated: IBM SP (blackforest) 256 128 384.00 512 total system batch CPUs; 256 dedicated to Community
Dedicated: Cray J90se (chipeta) 24 8 4.8  
Dedicated: Cray J90se (ouray) 24 8 4.8  
Dedicated: Compaq ES40 cluster (prospect) 36 36 72.0  
Shared: IBM SP (babyblue) 48 24 72.0 Shared new-release test platform; available for user use
Shared: SGI Origin2000 (dataproc) 16 16 8.0 Shared with CSL for data analysis and post-processing applications

FY2000 supercomputer resource changes

Additions and upgrades

During FY2000, SCD deployed two new supercomputers: one Compaq AlphaServer cluster and an SGI Origin2000. SCD also installed a Compaq XP-1000 management server for the Compaq ES40 cluster.

The IBM SP (babyblue) that was delivered to NCAR on 25 June 1999 was upgraded from 16 Winterhawk-I nodes (32 Power3 processors) and 1 GB of memory per node to 16 Winterhawk-II nodes (64 Power3 processors). Memory on babyblue was also upgraded from 1 GB per node to 2 GB per node.

The IBM SP system (blackforest) that was delivered to NCAR on 11 August 1999 was upgraded from 144 Winterhawk-I nodes (288 Power3 processors) and 1 GB of memory per node to 151 Winterhawk-II nodes (604 Power3 processors) and 2 GB of memory per node.

Decommissionings in FY2000

The Cray C90 (antero), which served the CSL since 01/01/97. The Cray C90 had 16 processors and was decommissioned on 30 November 1999.

The Cray J90 (paiute), which served the general Community since 10/01/95. The Cray J90 had 16 processors and was decommissioned on 30 June 1999.

The Cray J90 (aztec), which served the CSL since 10/01/95. The Cray J90 had 20 processors and was decommissioned on 30 June 1999.

Key maintenance activities

During FY2000, SCD provided ongoing maintenance activities to ensure the integrity and reliability of existing computational systems. Some of the key areas were:

Maintain supercomputer operating systems:
SCD stays apprised of major software releases from Cray Research and will carefully schedule upgrades to the production system and product set software based on the judged stability of those upgrades in the NCAR production environment. The Cray systems are considered to be in "maintenance mode," thus no significant enhancements or software upgrades were undertaken in FY2000. SCD also continued to provide major system support for SGI Origin2000, IBM SPs, and the Compaq ES40 cluster systems.

Maintain stability and reliability of systems:
One of the most significant attributes of the NCAR computational environment is its overall stability and reliability. For instance, the NCAR Mass Storage System has a reputation for reliability, and SCD has in the last year deployed a number of high-availability fileserver systems. This reliability and stability does not come easily; it stems from a combination of choosing reliable, stable vendor products and using proven, fail-safe system administration and maintenance techniques. SCD will continue to focus on ensuring, in whatever ways possible, highly stable and reliable systems and systems operations.

Year 2000 compliance and testing:
During FY1999, SCD engaged in a significant effort to ensure that all mission-critical resources maintained by SCD were Year-2000 compliant. The most significant objective was to ensure that all production systems would be unaffected by the transition into the next century. SCD worked with its major systems vendors to upgrade production systems' operating systems and product set software to Year-2000-compliant versions. In addition, SCD performed single- and multi-system testing of Year-2000 compliance of those systems and SCD-developed software and subsystems. SCD also provided system support during the "clock rollover period at midnight" and there were no major problems. We encountered a few minor problems in some local scripts, and these scripts were all fixed. Overall, the rollover into Year 2000 was very successful for SCD.

System monitoring:
Over the years, SCD has developed a large number of system monitoring procedures, techniques, and tools. SCD continued to enhance and utilize its collective experience to maintain the stability of the existing production systems through this proactive monitoring. In addition, SCD continued to enhance its monitoring tools, techniques, and procedures, and SCD automated a number of procedures for detecting system failure or trouble. This automation was integrated with commercial alphanumeric paging technology to provide more rapid alert mechanisms to SCD operations and systems staff and thus reduce the amount of time that systems are unavailable to the NCAR user community when they do fail.

Automated monitoring of systems

The High Performance Systems (HPS) section of SCD maintains a suite of system-monitoring utilities (known collectively as "sysmon") on all compute servers; these utilities monitor the servers and log critical system information. Currently the sysmon software routinely sends HPS staff brief reports on system utilization, error and warning conditions, and system daemon status. This software also keeps track of MSS activities on the supercomputers and alerts HPS staff and the SCD Computer Production Group (CPG) staff when anomalous conditions occur.

Sysmon has been a very useful tool for HPS and CPG. HPS enhanced and further automated the operation and monitoring of supercomputer systems and ported "sysmon" to the Compaq ES40 cluster (prospect), SGI Origin2000 (utefe), and the Compaq XP-1000 ES40 cluster (prospectfe) systems during FY2000.

In addition, in early FY2000, the High Performance Systems section of SCD, in cooperation with CPG, developed and began the operational deployment of additional system monitoring capabilities that are integrated with commercial paging services. These additional notification capabilities have not only freed CPG staff from some of the more mundane system operation and monitoring tasks, but they provide a much more timely alert mechanism to potential problems with the production supercomputers and Mass Storage System.


Next

SCD ASR - Table of contents

Message from SCD Director Al Kellie

SCD's FY2000 science highlights

SCD: Providing support for large and small scientific research projects, no matter where they are located

SCD: A center for supercomputing resources and technologies

SCD: A center for data resources, data analysis, and emerging technologies

SCD research: Pushing the frontiers in high-performance computing for geosciences

SCD: Providing supercomputing and communications facilities and infrastructure

SCD community service activities

SCD educational activities

SCD publications and papers

SCD staff

SCD visitors and collaborators