[Previous] [Table of contents] [Next]

Maintaining NCAR's production supercomputer environment

History

The production supercomputer environment managed by SCD for NCAR has evolved over the years. During the last 15 years, SCD has brought NCAR's science into the multi-processing supercomputer world. Prior to the introduction of the four-CPU Cray X-MP in October 1986, all modeling was performed with serial codes. Since then, the focus has been on redeveloping codes to harness the power of multiple CPUs in a single system and, most recently, of multiple systems.

During the last 15 years, SCD has deployed a series of parallel-vector processor (PVP) systems ranging from a 2-CPU Cray Y-MP to a pair of 24-CPU Cray J90se systems. Massively parallel (MPP) systems included the Cray T3D, with 128 processors, and the Thinking Machines CM2 and CM5 systems. Most recently, distributed shared-memory (DSM) systems have been deployed, these have included the Hewlett-Packard SPP-2000 and Silicon Graphics Origin2000, the IBM SPs, and the Compaq ES40 cluster.

The following diagram shows the systems that SCD has deployed for NCAR's use since its inception. The systems shown with blue bars are those deployed for production purposes, those shown in red were (are) considered experimental systems.

In 1986, with the first multiprocessor system (the Cray X-MP/4) on NCAR's floor, SCD could deliver on average approximately 0.25 Gflops of sustainable computing capacity to NCAR's science. In the roughly 15 years since, that sustained computing capacity has grown by more than two orders of magnitude.

FY1999 production system overview

There were significant changes made to the production supercomputer environment during FY1999. Most notable, of course, was the installation of the IBM SP systems (blackforest and babyblue). The IBM SP "babyblue" system is a single-frame, 16-node system which was delivered about six weeks before the main system and served as an invaluable learning tool for SCD and early users on all aspects of the system -- from system administration and software maintenance to compiler and user-environment capabilities. The main system, the 144-node IBM SP "blackforest" was delivered on 11 August and successfully passed the 30-day acceptance testing on 3 October. Since then, blackforest and babyblue have been in production as a shared resource between the CSL and Community facilities. Detailed information on the IBM SP system is provided on the SCD website at the blackforest main page.

Supercomputer systems installed during FY1999

DSM
IBM SP (blackforest), with 288 processors (256 processors reserved for batch processing, the remainder used for interactive access, I/O management and network communications), used by both Climate Simulation Laboratory and Community users

IBM SP (babyblue), with 32 processors (20 processors reserved for batch processing, the remainder used for interactive access, I/O management and network communications), used by SCD to test new software before placing it on the production SP system (blackforest) and also used by both Climate Simulation Laboratory and Community users for production work

SCD continued to maintain and enhance its production supercomputer systems during FY1999. These included the new Distributed Shared Memory (DSM) systems installed in recent previous fiscal years as well as the older Parallel Vector Processor (PVP) and Massively Parallel Processor (MPP) systems. In these categories were:

Supercomputer systems maintained during FY1999

DSM
SGI Origin2000 (ute), with 128 processors, used in the Climate Simulation Laboratory

SGI Origin2000 (dataproc), with 16 processors, which replaced the old winterpark system, used by both Climate Simulation Laboratory and Community users

SGI Origin2000 (mouache), with 4 processors, which was used as a test platform by SCD for evaluation of new Irix systems, libraries, and compilers prior to their installation on the production SGI platforms; all interested users now have access to mouache

HP SPP-2000 (sioux), with 64 processors, served a small set of Community supercomputing users until it was decommissioned on 14 May

MPP
Cray T3D, with 128 processors, was used in the Climate Simulation Laboratory, attached to the Cray C90 (antero), until it was decommissioned on 8 June

PVP
Cray C90 (antero), with 16 processors, was used by the Climate Simulation Laboratory

Cray J90 (aztec), with 20 processors, was used by the Climate Simulation Laboratory

Cray J90 (paiute), with 16 processors, was used by the Community

Cray J90se (chipeta), with 24 processors, was used by the Community

Cray J90se (ouray), with 24 processors, was used by the Community

Production system performance and utilization statistics

At the end of FY1999, the "production supercomputer environment" managed by SCD for NCAR includes five Cray supercomputers, two IBM supercomputers, and three SGI supercomputers. The following tables provide average utilization and performance statistics for the supercomputer systems SCD operated in production during FY1999.

In addition, SCD publishes monthly usage reports at http://www.scd.ucar.edu/dbsg/dbs/. These reports provide summary information on system usage, project allocations and General Accounting Unit (GAU) use.

 

Production systems at eFY1999: Average performance and utilization statistics

System name Hardware/#PEs Notes Gflops Utilz'n User Idle System WaitIO IOfs IOswp

antero

Cray C90/16

CSL

4.541

91.1%

93.3%

4.0%

2.7%

0.4%

--

--

aztec

Cray J90/20

CSL

1.079

85.7%

87.1%

10.2%

2.6%

0.5%

--

--

babyblue

IBM SP/20

CSL & Community
installed 25 June 1999

--

71.0%

72.5%

25.3%

1.7%

0.7%

--

--

blackforest

IBM SP/256

CSL & Community
installed 11 August 1999

--

77.5%

78.2%

21.1%

0.5%

0.2%

--

--

chipeta

Cray J90se/24

Community

1.608

92.4%

94.0%

2.2%

3.7%

0.2%

--

--

dataproc

SGI O2K/16

CSL & Community
new 3 May 1999

--

49.1%

49.7%

39.4%

7.3%

3.4%

86.2%

0.1%

mouache

SGI O2K/4

CSL & Community

--

30.9%

31.0%

66.0%

2.3%

0.4%

47.2%

13.6%

ouray

Cray J90se/24

Community

1.523

92.2%

93.9%

2.6%

3.6%

0.3%

--

--

paiute

Cray J90/16

Community

0.880

71.2%

72.2%

23.7%

4.1%

4.5%

--

--

ute

SGI O2K/128

CSL

--

72.1%

73.2%

24.2%

1.8%

0.6%

80.9%

3.9%

Where "Gflops" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.

Production systems decommissioned during FY1999: Average performance and utilization statistics

System Name Hardware/#PEs Notes Gflops Utilz'n User Idle System WaitIO IOfs IOswp

T3D

Cray T3D/128

CSL
decommissioned 8 June 1999

--

69.9%

72.8%

27.2%

--

--

--

--

sioux

HP SPP-2000/64

Community
decommissioned 14 May 1999

--

41.8%

42.4%

57.6%

--

--

--

-

winterpark

SGI PowerChallenge XL/8

CSL & Community
decommissioned 3 May 1999

--

29.6%

29.8%

40.8%

7.7%

20.6%

86.5%

12.8%

Where "Gflops" is the average number of floating point operations per second (in billions) during the measuring period; "Utilz'n" is the average user utilization of the system (system downtime counts against utilization); "User" is the percent of uptime occupied in performing computation for user processes; "Idle" is the percent of uptime spent idle; "System" is the percent of uptime consumed in system overhead; "WaitIO" is the percent of uptime spent awaiting I/O completion; "IOfs" is the percent of the WaitIO time spent in performing user filesystem I/O; and "IOswp" is the percent of the WaitIO time spent in performing process swapping/paging.

End-FY1999 production supercomputer systems

The SCD supercomputer resources are comprised of two relatively separate computational facilities: the Climate Simulation Laboratory (CSL) and Community facilities. Some systems, such as the new IBM SP systems and the "dataproc" system, are shared between these two facilities. The following sections describe the supercomputer systems available in these two facilities.

CSL facility

The Climate Simulation Laboratory facility had available the following supercomputer resources at the end of FY1999:

  System # CPUs GB memory Peak Gflops Notes

Dedicated:

IBM SP (blackforest)

128

64

102.4

256 total system batch CPUs; half dedicated to CSL

 

SGI Origin2000 (ute)

128

16

64.0

 

 

Cray C90 (antero)

16

2

15.3

 

 

Cray J90 (aztec)

20

4

4.0

 

Shared:

IBM SP (babyblue)

20

10

16.0

Shared new-release test platform; available for user use

 

SGI Origin2000 (dataproc)

16

16

8.0

Shared with Community for data analysis and post-processing applications

 

Community facility

The Community facility had available the following supercomputer resources at the end of FY1999:

  System # CPUs GB memory Peak Gflops Notes

Dedicated:

IBM SP (blackforest)

128

64

102.4

256 total system batch CPUs; half dedicated to Community

 

Cray J90se (chipeta)

24

8

4.8

 

 

Cray J90se (ouray)

24

8

4.8

 

 

Cray J90 (paiute)

16

2

3.2

 

Shared:

IBM SP (babyblue)

20

10

16.0

Shared new-release test platform; available for user use

 

SGI Origin2000 (dataproc)

16

16

8.0

Shared with CSL for data analysis and post-processing applications

 

FY1999 supercomputer resource changes

Additions and upgrades

During FY1999, SCD deployed three new supercomputers: two IBM SPs and an SGI Origin2000.

The first IBM SP (babyblue) was delivered to NCAR on 25 June 1999. It is a single-frame system with 16 Winterhawk-I nodes (32 Power3 processors), 1 GB of memory per node and approximately 200 GB disk capacity. This system was used by SCD to familiarize the system administrators, consultants and other staff with the IBM SP and AIX environments and will be used to install and test future software upgrades before they're put into production on the large SP system. This system is also available for use by any user having access to the larger SP (blackforest).

The production IBM SP system (blackforest) was delivered to NCAR on 11 August 1999, and it successfully completed its 30-day Acceptance Test Period (ATP) on 3 October 1999. The system consists of 144 Winterhawk-I nodes (288 Power3 processors), 1 GB of memory per node, and approximately 2.5 TB disk capacity. Of the 144 nodes, 128 are reserved for dedicated batch work: 64 nodes (128 processors) for CSL and 64 nodes (128 processors) for the Community. The other 16 nodes are shared between CSL and the Community for interactive access, non-dedicated batch work, and for handling filesystem I/O and network communications.

An SGI Origin2000/16 (dataproc) was put into production on 3 May 1999 to replace the original "DataPark" system, winterpark, which had become severely overloaded. Dataproc is available for any user having either a CSL or Community allocation, for performing data analysis and/or model output post-processing work.

Decommissionings in FY1999

The Cray T3D, which had served the CSL since July 1994 (initially, it contained 64 processors; it was upgraded to 128 processors in March 1997) was decommissioned on 8 June 1999.

The Hewlett-Packard SPP-2000 (sioux), which had been acquired by SCD in May 1997 initially as an evaluation system, and was transitioned into the Community facility for production work in early 1998, was decommissioned on 14 May 1999.

The SGI PowerChallenge XL/8 (winterpark), which had served as a proof-of-concept system, called "DataPark" by SCD, for an interactive platform for data processing and model post analysis, was decommissioned on 3 May 1999 and replaced by a 16-processor SGI Origin2000 (dataproc).

Key maintenance activities

During FY1999, SCD provided ongoing maintenance activities to ensure the integrity and reliability of existing computational systems. Some of the key areas were:

Automated monitoring of systems

The High Performance Systems (HPS) section of SCD maintains a suite of system-monitoring utilities (known collectively as "sysmon") on all compute servers; these utilities monitor the servers and log critical system information. Currently the sysmon software routinely sends HPS members brief reports on system utilization, error and warning conditions, and system daemon status. This software also keeps track of MSS activities on the supercomputers and alerts HPS staff and the SCD Computer Production Group (CPG) staff when anomalous conditions occur.

Sysmon has been a very useful tool for HPS and CPG. HPS enhanced and further automated the operation and monitoring of supercomputer systems and ported "sysmon" to the IBM SP systems during FY1999.

In addition, in early FY1999, the High Performance Systems section of SCD, in cooperation with CPG, developed and began the operational deployment of additional system monitoring capabilities that are integrated with commercial paging services. These additional notification capabilities have not only freed CPG staff from some of the more mundane system operation and monitoring tasks, but they provide a much more timely alert mechanism to potential problems with the production supercomputers, Mass Storage System, and SCD server systems.

Dedicated data processing platforms

During FY1997, SCD established a prototype "DataPark" system. It was comprised of a Silicon Graphics Challenge and a Silicon Graphics Power Challenge computer. In FY1998, SCD augmented these systems, and as technologies changed and user interest for dedicated data-processing platforms grew, the DataPark concept evolved. During FY1999, the file services portion of the DataPark concept was integrated into the future plans for the NCAR Mass Storage System (see the MSS Roadmap), and the data-processing engine portion took on a life of its own.

SCD's deployment of a small supercomputer system dedicated to data analysis and model output post-processing has been well-received. The initial system was an eight-processor SGI PowerChallenge XL (winterpark). It served as a proof-of-concept system for an interactive platform for data processing and model post analysis. It was decommissioned on 3 May 1999 and replaced by a 16-processor SGI Origin2000 (dataproc) with 16 GB of memory and over 1 TB of disk capacity. Dataproc is available for any user having either a CSL or Community allocation, for performing data analysis and/or model output post-processing work.

Because of the heavy usage of the existing dataproc system and keen interest among NCAR divisions, SCD is considering deploying "divisional dataproc" systems. In addition, SCD has begun discussions with member universities to deploy, administer, and operate similar systems for those universities within the SCD computer facility. We hope that FY2000 will see a proliferation of these types of systems.


[Previous] [Table of contents] [Next]