
![]()
[Previous]
[Table of contents]
[Next]
History
The production supercomputer environment managed by SCD for NCAR has evolved over the years. During the last 15 years, SCD has brought NCAR's science into the multi-processing supercomputer world. Prior to the introduction of the four-CPU Cray X-MP in October 1986, all modeling was performed with serial codes. Since then, the focus has been on redeveloping codes to harness the power of multiple CPUs in a single system and, most recently, of multiple systems.
During the last 15 years, SCD has deployed a series of parallel-vector processor (PVP) systems ranging from a 2-CPU Cray Y-MP to a pair of 24-CPU Cray J90se systems. Massively parallel (MPP) systems included the Cray T3D, with 128 processors, and the Thinking Machines CM2 and CM5 systems. Most recently, distributed shared-memory (DSM) systems have been deployed, these have included the Hewlett-Packard SPP-2000 and Silicon Graphics Origin2000, the IBM SPs, and the Compaq ES40 cluster.
The following diagram shows the systems that SCD has deployed for NCAR's use since its inception. The systems shown with blue bars are those deployed for production purposes, those shown in red were (are) considered experimental systems.
In 1986, with the first multiprocessor system (the Cray X-MP/4) on NCAR's floor, SCD could deliver on average approximately 0.25 Gflops of sustainable computing capacity to NCAR's science. In the roughly 15 years since, that sustained computing capacity has grown by more than two orders of magnitude.
![]()
FY1999 production system overview
There were significant changes made to the production supercomputer environment during FY1999. Most notable, of course, was the installation of the IBM SP systems (blackforest and babyblue). The IBM SP "babyblue" system is a single-frame, 16-node system which was delivered about six weeks before the main system and served as an invaluable learning tool for SCD and early users on all aspects of the system -- from system administration and software maintenance to compiler and user-environment capabilities. The main system, the 144-node IBM SP "blackforest" was delivered on 11 August and successfully passed the 30-day acceptance testing on 3 October. Since then, blackforest and babyblue have been in production as a shared resource between the CSL and Community facilities. Detailed information on the IBM SP system is provided on the SCD website at the blackforest main page.
Supercomputer systems installed during FY1999
- DSM
- IBM SP (blackforest), with 288 processors (256 processors reserved for batch processing, the remainder used for interactive access, I/O management and network communications), used by both Climate Simulation Laboratory and Community users
IBM SP (babyblue), with 32 processors (20 processors reserved for batch processing, the remainder used for interactive access, I/O management and network communications), used by SCD to test new software before placing it on the production SP system (blackforest) and also used by both Climate Simulation Laboratory and Community users for production work
SCD continued to maintain and enhance its production supercomputer systems during FY1999. These included the new Distributed Shared Memory (DSM) systems installed in recent previous fiscal years as well as the older Parallel Vector Processor (PVP) and Massively Parallel Processor (MPP) systems. In these categories were:
Supercomputer systems maintained during FY1999
- DSM
- SGI Origin2000 (ute), with 128 processors, used in the Climate Simulation Laboratory
SGI Origin2000 (dataproc), with 16 processors, which replaced the old winterpark system, used by both Climate Simulation Laboratory and Community users
SGI Origin2000 (mouache), with 4 processors, which was used as a test platform by SCD for evaluation of new Irix systems, libraries, and compilers prior to their installation on the production SGI platforms; all interested users now have access to mouache
HP SPP-2000 (sioux), with 64 processors, served a small set of Community supercomputing users until it was decommissioned on 14 May
- MPP
- Cray T3D, with 128 processors, was used in the Climate Simulation Laboratory, attached to the Cray C90 (antero), until it was decommissioned on 8 June
- PVP
- Cray C90 (antero), with 16 processors, was used by the Climate Simulation Laboratory
Cray J90 (aztec), with 20 processors, was used by the Climate Simulation Laboratory
Cray J90 (paiute), with 16 processors, was used by the Community
Cray J90se (chipeta), with 24 processors, was used by the Community
Cray J90se (ouray), with 24 processors, was used by the Community
Production system performance and utilization statistics
At the end of FY1999, the "production supercomputer environment" managed by SCD for NCAR includes five Cray supercomputers, two IBM supercomputers, and three SGI supercomputers. The following tables provide average utilization and performance statistics for the supercomputer systems SCD operated in production during FY1999.In addition, SCD publishes monthly usage reports at http://www.scd.ucar.edu/dbsg/dbs/. These reports provide summary information on system usage, project allocations and General Accounting Unit (GAU) use.
Production systems at eFY1999: Average performance and utilization statistics
| System name | Hardware/#PEs | Notes | Gflops | Utilz'n | User | Idle | System | WaitIO | IOfs | IOswp |
|---|---|---|---|---|---|---|---|---|---|---|
|
antero |
Cray C90/16 |
CSL |
4.541 |
91.1% |
93.3% |
4.0% |
2.7% |
0.4% |
-- |
-- |
|
aztec |
Cray J90/20 |
CSL |
1.079 |
85.7% |
87.1% |
10.2% |
2.6% |
0.5% |
-- |
-- |
|
babyblue |
IBM SP/20 |
CSL & Community |
-- |
71.0% |
72.5% |
25.3% |
1.7% |
0.7% |
-- |
-- |
|
blackforest |
IBM SP/256 |
CSL & Community |
-- |
77.5% |
78.2% |
21.1% |
0.5% |
0.2% |
-- |
-- |
|
chipeta |
Cray J90se/24 |
Community |
1.608 |
92.4% |
94.0% |
2.2% |
3.7% |
0.2% |
-- |
-- |
|
dataproc |
SGI O2K/16 |
CSL & Community |
-- |
49.1% |
49.7% |
39.4% |
7.3% |
3.4% |
86.2% |
0.1% |
|
mouache |
SGI O2K/4 |
CSL & Community |
-- |
30.9% |
31.0% |
66.0% |
2.3% |
0.4% |
47.2% |
13.6% |
|
ouray |
Cray J90se/24 |
Community |
1.523 |
92.2% |
93.9% |
2.6% |
3.6% |
0.3% |
-- |
-- |
|
paiute |
Cray J90/16 |
Community |
0.880 |
71.2% |
72.2% |
23.7% |
4.1% |
4.5% |
-- |
-- |
|
ute |
SGI O2K/128 |
CSL |
-- |
72.1% |
73.2% |
24.2% |
1.8% |
0.6% |
80.9% |
3.9% |
Where "Gflops" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.
Production systems decommissioned during FY1999: Average performance and utilization statistics
| System Name | Hardware/#PEs | Notes | Gflops | Utilz'n | User | Idle | System | WaitIO | IOfs | IOswp |
|---|---|---|---|---|---|---|---|---|---|---|
|
T3D |
Cray T3D/128 |
CSL |
-- |
69.9% |
72.8% |
27.2% |
-- |
-- |
-- |
-- |
|
sioux |
HP SPP-2000/64 |
Community |
-- |
41.8% |
42.4% |
57.6% |
-- |
-- |
-- |
- |
|
winterpark |
SGI PowerChallenge XL/8 |
CSL & Community |
-- |
29.6% |
29.8% |
40.8% |
7.7% |
20.6% |
86.5% |
12.8% |
Where "Gflops" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.
End-FY1999 production supercomputer systems
The SCD supercomputer resources are comprised of two relatively separate computational facilities: the Climate Simulation Laboratory (CSL) and Community facilities. Some systems, such as the new IBM SP systems and the "dataproc" system, are shared between these two facilities. The following sections describe the supercomputer systems available in these two facilities.
CSL facility
The Climate Simulation Laboratory facility had available the following supercomputer resources at the end of FY1999:
| System | # CPUs | GB memory | Peak Gflops | Notes | |
|---|---|---|---|---|---|
|
Dedicated: |
IBM SP (blackforest) |
128 |
64 |
102.4 |
256 total system batch CPUs; half dedicated to CSL |
|
|
SGI Origin2000 (ute) |
128 |
16 |
64.0 |
|
|
|
Cray C90 (antero) |
16 |
2 |
15.3 |
|
|
|
Cray J90 (aztec) |
20 |
4 |
4.0 |
|
|
Shared: |
IBM SP (babyblue) |
20 |
10 |
16.0 |
Shared new-release test platform; available for user use |
|
|
SGI Origin2000 (dataproc) |
16 |
16 |
8.0 |
Shared with Community for data analysis and post-processing applications |
Community facility
The Community facility had available the following supercomputer resources at the end of FY1999:
| System | # CPUs | GB memory | Peak Gflops | Notes | |
|---|---|---|---|---|---|
|
Dedicated: |
IBM SP (blackforest) |
128 |
64 |
102.4 |
256 total system batch CPUs; half dedicated to Community |
|
|
Cray J90se (chipeta) |
24 |
8 |
4.8 |
|
|
|
Cray J90se (ouray) |
24 |
8 |
4.8 |
|
|
|
Cray J90 (paiute) |
16 |
2 |
3.2 |
|
|
Shared: |
IBM SP (babyblue) |
20 |
10 |
16.0 |
Shared new-release test platform; available for user use |
|
|
SGI Origin2000 (dataproc) |
16 |
16 |
8.0 |
Shared with CSL for data analysis and post-processing applications |
FY1999 supercomputer resource changes
Additions and upgrades
During FY1999, SCD deployed three new supercomputers: two IBM SPs and an SGI Origin2000.
The first IBM SP (babyblue) was delivered to NCAR on 25 June 1999. It is a single-frame system with 16 Winterhawk-I nodes (32 Power3 processors), 1 GB of memory per node and approximately 200 GB disk capacity. This system was used by SCD to familiarize the system administrators, consultants and other staff with the IBM SP and AIX environments and will be used to install and test future software upgrades before they're put into production on the large SP system. This system is also available for use by any user having access to the larger SP (blackforest).
The production IBM SP system (blackforest) was delivered to NCAR on 11 August 1999, and it successfully completed its 30-day Acceptance Test Period (ATP) on 3 October 1999. The system consists of 144 Winterhawk-I nodes (288 Power3 processors), 1 GB of memory per node, and approximately 2.5 TB disk capacity. Of the 144 nodes, 128 are reserved for dedicated batch work: 64 nodes (128 processors) for CSL and 64 nodes (128 processors) for the Community. The other 16 nodes are shared between CSL and the Community for interactive access, non-dedicated batch work, and for handling filesystem I/O and network communications.
An SGI Origin2000/16 (dataproc) was put into production on 3 May 1999 to replace the original "DataPark" system, winterpark, which had become severely overloaded. Dataproc is available for any user having either a CSL or Community allocation, for performing data analysis and/or model output post-processing work.
Decommissionings in FY1999
The Cray T3D, which had served the CSL since July 1994 (initially, it contained 64 processors; it was upgraded to 128 processors in March 1997) was decommissioned on 8 June 1999.
The Hewlett-Packard SPP-2000 (sioux), which had been acquired by SCD in May 1997 initially as an evaluation system, and was transitioned into the Community facility for production work in early 1998, was decommissioned on 14 May 1999.
The SGI PowerChallenge XL/8 (winterpark), which had served as a proof-of-concept system, called "DataPark" by SCD, for an interactive platform for data processing and model post analysis, was decommissioned on 3 May 1999 and replaced by a 16-processor SGI Origin2000 (dataproc).
Key maintenance activities
During FY1999, SCD provided ongoing maintenance activities to ensure the integrity and reliability of existing computational systems. Some of the key areas were:
- Maintain supercomputer operating systems:
The Cray systems were upgraded to the latest release of UNICOS (10.0.0.3) and I/O subsystem software in the early fall. SCD intends to stay apprised of major software releases from Silicon Graphics/Cray Research and carefully schedule upgrades to the production system and product set software based on the judged stability of those upgrades in the NCAR production environment -- the Cray systems are considered to be in "maintenance mode," thus no significant enhancements or software upgrades will be undertaken. SCD plans to decommission the older Cray systems during FY2000: the Cray C90 (antero) will be decommissioned on 30 November 1999, the Cray J90 (aztec) on 31 March 2000 and the Cray J90 (paiute) on 30 June 2000. The two Community J90se systems (chipeta and ouray) will be retained in production through FY2000.
- Maintain stability and reliability of systems:
One of the most significant attributes of the NCAR computational environment is its overall stability and reliability. For instance, the NCAR Mass Storage System has a reputation for reliability, and SCD has in the last year deployed a number of high-availability fileserver systems. This reliability and stability does not come easily; it stems from a combination of choosing reliable, stable vendor products and using proven, fail-safe system administration and maintenance techniques. SCD will continue to focus on ensuring, in whatever ways possible, highly stable and reliable systems and systems operations.
- Year 2000 compliance and testing:
During FY1999, SCD engaged itself in a significant effort to ensure that all mission-critical resources maintained by SCD are Year-2000 compliant. The most significant objective was to ensure that all production systems are unaffected by the transition into the next century. SCD has been working with its major systems vendors to upgrade production systems' operating systems and product set software to Year-2000-compliant versions. In addition, SCD performed single- and multi-system testing of Year-2000 compliance of those systems and SCD-developed software and subsystems. More information appears in the Year 2000 planning and testing report.
- System monitoring:
Over the years, SCD has developed a large number of system monitoring procedures, techniques, and tools. SCD continued to enhance and utilize its collective experience to maintain the stability of the existing production systems through this proactive monitoring. In addition, SCD continued to enhance its monitoring tools, techniques, and procedures, and SCD automated a number of procedures for detecting system failure or trouble. This automation has been integrated with commercial alphanumeric paging technology to provide more rapid alert mechanisms to SCD operations and systems staff and thus reduce the amount of time that systems are unavailable to the NCAR user community when they do fail.
- High-speed communications:
SCD has deployed a high-speed HiPPI data communications "fabric" within the NCAR computer center; in addition, more traditional networking capabilities, including ATM, FDDI, and Ethernet technologies have become the mainstay for connectivity between computational systems at NCAR and its divisions and remote departments. All of these systems are routinely monitored and maintained. SCD will continue to provide these stable, high-speed communications interfaces and enhance them with new technologies as those technologies prove their reliability and stability within our environment. Furthermore, as SCD enhances its computational capabilities, it will maintain and enhance its high-bandwidth connections between the NCAR MSS and the supercomputers to ensure that the systems are balanced and capable of satisfying the loads imposed by the growing need for both computational and storage capacities. In addition, during FY1999 SCD deployed new high-speed data and communications technologies, such as Fibre Channel and Gigabit Ethernet, as replacements for older, less-supported technologies.Automated monitoring of systems
The High Performance Systems (HPS) section of SCD maintains a suite of system-monitoring utilities (known collectively as "sysmon") on all compute servers; these utilities monitor the servers and log critical system information. Currently the sysmon software routinely sends HPS members brief reports on system utilization, error and warning conditions, and system daemon status. This software also keeps track of MSS activities on the supercomputers and alerts HPS staff and the SCD Computer Production Group (CPG) staff when anomalous conditions occur.
Sysmon has been a very useful tool for HPS and CPG. HPS enhanced and further automated the operation and monitoring of supercomputer systems and ported "sysmon" to the IBM SP systems during FY1999.
In addition, in early FY1999, the High Performance Systems section of SCD, in cooperation with CPG, developed and began the operational deployment of additional system monitoring capabilities that are integrated with commercial paging services. These additional notification capabilities have not only freed CPG staff from some of the more mundane system operation and monitoring tasks, but they provide a much more timely alert mechanism to potential problems with the production supercomputers, Mass Storage System, and SCD server systems.
Dedicated data processing platforms
During FY1997, SCD established a prototype "DataPark" system. It was comprised of a Silicon Graphics Challenge and a Silicon Graphics Power Challenge computer. In FY1998, SCD augmented these systems, and as technologies changed and user interest for dedicated data-processing platforms grew, the DataPark concept evolved. During FY1999, the file services portion of the DataPark concept was integrated into the future plans for the NCAR Mass Storage System (see the MSS Roadmap), and the data-processing engine portion took on a life of its own.
SCD's deployment of a small supercomputer system dedicated to data analysis and model output post-processing has been well-received. The initial system was an eight-processor SGI PowerChallenge XL (winterpark). It served as a proof-of-concept system for an interactive platform for data processing and model post analysis. It was decommissioned on 3 May 1999 and replaced by a 16-processor SGI Origin2000 (dataproc) with 16 GB of memory and over 1 TB of disk capacity. Dataproc is available for any user having either a CSL or Community allocation, for performing data analysis and/or model output post-processing work.
Because of the heavy usage of the existing dataproc system and keen interest among NCAR divisions, SCD is considering deploying "divisional dataproc" systems. In addition, SCD has begun discussions with member universities to deploy, administer, and operate similar systems for those universities within the SCD computer facility. We hope that FY2000 will see a proliferation of these types of systems.