High performance computing
The
production supercomputer environment managed by SCD for NCAR has
evolved over the years. During the last 19 years, SCD has brought
NCAR's
science into the multiprocessing supercomputer world. Prior to the
introduction of the four-CPU Cray X-MP in October 1986, all modeling
was
performed with serial codes. Since then, the focus has been on
redeveloping
codes to harness the power of multiple CPUs in a single system, and
most
recently, in multiple systems.
During
the last 19 years, SCD has deployed a series of Parallel Vector
Processor (PVP) systems ranging from a 2-CPU Cray Y-MP to a pair of
24-CPU
Cray J90se systems. Massively Parallel Processing (MPP) systems
included the
Cray T3D with 128 processors and the Thinking Machines CM2 and CM5
systems.
Most recently, Distributed Shared Memory (DSM) systems have been
deployed;
these include the Hewlett-Packard SPP-2000, SGI Origin2000, Compaq ES40
cluster, SGI Origin3800, and the IBM SP POWER3 and POWER4 systems.
The following diagram shows the systems that SCD has deployed for NCAR's
use since its inception. The systems shown with blue bars are those
deployed
for production purposes, those shown in red were (are) considered
experimental systems.

In
1986, with the first multiprocessor system (the Cray X-MP/4) on
NCAR's floor, SCD could deliver on average approximately 0.25 GFLOPS
of sustainable computing capacity to NCAR's science. In the roughly
19 years since, that sustained computing capacity has grown
significantly.

In
FY2003, Phase II of the Advanced Research Computing
System (ARCS) was delivered. In October 2002, SCD took delivery of
Phase II of NCAR's Advanced Research Computing System (ARCS). The
delivery added a complete new IBM Cluster 1600 system, called bluesky,
to the SCD computational environment.
The
bluesky system introduced IBM's next-generation POWER4 processor. The
POWER3 and POWER4 processors can perform 4 floating point operations
per clock tick. The POWER4 runs at a clock speed of 1.3 gigahertz,
substantially faster than the 375-megahertz POWER3 processor. Each
POWER4 processor has a peak computation rate of 5.2 gigaflops.
The bluesky system was comprised of 38 Regatta-H Turbo frames, each
with 32 POWER4 processors. Thus the bluesky system has a total of 1,216
POWER4 processors and peak computation rate of 6.323 teraflops.
Also in FY2003, Phase III of the current Advanced Research
Computing System (ARCS) was delivered to NCAR. This was an expansion of
the IBM cluster (bluesky) by 14, 32-way p690 SMP servers (i.e., frames),
with each server based on the POWER4 microprocessor and operating at
a clock frequency of 1.3 GHz. Each server included 64 GB of memory. The
system expansion also included 10.5 TB of formatted disk storage, which
was added to the existing disk subsystem, thereby increasing bluesky's
total disk capacity to 31 TB. Of the 14 servers, only 12 were added to
bluesky, the remaining two servers are temporarily being used for a
special SCD testbed project. As of now, bluesky is comprised of 50
POWER4 38 Regatta-H Turbo frames, making it the single largest
system of this type in the world.
Initially, the 12 additional 32-way P690 SMP servers will be used
for the expanded use of CCSM for contributions to the IPCC process,
as noted in SCD's Annual Budget Review. The installation of the
blueksy system and its subsequent augmentation has doubled the
capacities of both the Climate Simulation Laboratory and Community
computing.
Further, there were several major system
software upgrades performed on all supercomputers.
Supercomputer systems maintained during FY2003
DSM
systems:
- SGI
Origin2100 (chinookfe), with 8 processors, was used in the Climate
Simulation Laboratory.
- SGI
Origin3800 (chinook), with 128 processors, was used in the Climate
Simulation Laboratory.
- SGI
Origin2000 (dataproc), with 16 processors, was used by both Climate
Simulation Laboratory and Community users.
- SGI
Origin2000 (mouache), with 4 processors, was used as a test platform by
SCD for evaluation of new Irix systems, libraries, and compilers prior
to their installation on the production SGI platforms; all interested
users now have access to mouache.
- IBM
SP (babyblue), with 64 processors, was shared by the Climate Simulation
Laboratory and the Community.
- IBM
SP (blackforest), with 1,308 processors, was shared by the Climate
Simulation Laboratory and the Community.
- IBM
NightHawk2 (dave), with 16 processors, was shared by the Climate
Simulation Laboratory and the Community.
- IBM
p690 Regatta (bluedawn), with 16 processors, was used as a test and
development platform for the integration of the IBM POWER4 Cluster 1600.
- IBM
Cluster 1600 (bluesky), with 1,600 processors, was shared by the
Climate
Simulation Laboratory and the Community.
At
the end of FY2003, the production supercomputer environment managed by
SCD for NCAR included five IBM supercomputers and four SGI
supercomputers.
The following tables provide average utilization and performance
statistics
for the production supercomputer systems SCD operated in FY2003.
In
addition, SCD publishes monthly usage reports at http://www.scd.ucar.edu/dbsg/dbs/.
These reports provide summary information on system usage, project
allocations, and General Accounting Unit (GAU) use.
Production systems for
FY2003:
Average performance and utilization statistics
|
System name
|
Hardware/ #PEs
|
Notes
|
GFLOPs
|
Utilz'n
|
User
|
Idle
|
System
|
WaitIO
|
IOfs
|
IOswp
|
| babyblue |
IBM SP/48 |
CSL and Community |
1.17 |
31.9% |
24.7% |
73.8% |
0.8% |
0.0% |
-- |
-- |
| blackforest |
IBM SP/1128 |
CSL and Community |
60.5
|
73.4% |
61.1% |
37.3% |
1.3% |
0.0% |
-- |
-- |
| blackforest |
IBM NH2/32 |
CSL and Community |
2.10
|
53.0% |
45.5% |
53.6% |
0.8% |
0.0% |
-- |
-- |
| bluedawn |
IBM p690/16 |
Test and Development |
0.59
|
13.9% |
11.9% |
86.6% |
1.0% |
0.0% |
-- |
-- |
| bluesky B32Way |
IBM p690/800 |
CSL and Community |
39.6
|
66.7% |
56.1% |
42.7% |
1.1% |
0.0% |
-- |
-- |
| bluesky B8Way |
IBM p690/608 |
CSL and Community |
110.7
|
79.5% |
63.8% |
34.9% |
1.2% |
0.0% |
-- |
-- |
| dave |
IBM NH2/16 |
CSL and Community |
0.11
|
10.1% |
10.1% |
86.6% |
2.5% |
0.2% |
-- |
-- |
| dataproc |
SGI O2K/16 |
CSL and Community |
-
|
21.3% |
21.5% |
72.4% |
4.7% |
1.2% |
85.3% |
0.2% |
| mouache |
SGI O2K/4 |
CSL and Community |
--
|
1.3% |
1.3% |
97.0% |
0.3% |
0.6% |
45.4% |
35.4% |
| chinook |
SGI O3K/128 |
CSL |
--
|
70.3% |
70.6% |
28.0% |
1.2% |
0.0% |
75.8% |
0.0% |
|
| chinookfe |
SGI O2K/8 |
CSL |
--
|
32.6% |
32.7% |
63.3% |
3.6% |
0.2% |
77.3% |
2.7% |
|
| |
Note:
"Utilz'n" is the average user utilization of the
system (system downtime counts against utilization); "User" is the
percent of uptime occupied in performing computation for user
processes;
"Idle" is the percent of uptime spent idle; "System" is
the percent of uptime consumed in system overhead; "WaitIO" is the
percent of uptime spent awaiting I/O completion; "IOfs" is the
percent of the WaitIO time spent in performing user filesystem I/O; and
"IOswp" is the percent of the WaitIO time spent in performing process
swapping/paging. |
|
The
SCD supercomputer resources are comprised of two separate
computational facilities: the Climate Simulation Laboratory (CSL) and
Community
Computing facilities. Some systems, such as the IBM SP systems, the
dave
system, and the dataproc system are shared between these two
facilities. The following sections describe the supercomputer systems
available in these two facilities.
CSL
facility
The
Climate Simulation Laboratory facility provided the following
supercomputer resources at the end of FY2003:
Climate Simulation
Lab facility, FY2003 configuration
|
| |
System
|
# CPUs
|
GB
memory
|
Peak
GFLOPS
|
Notes
|
| Dedicated: |
IBM SP (blackforest) |
560 |
280 |
840.0 |
1,120 total system batch CPUs; 560
dedicated to CSL |
| Dedicated: |
IBM SP
(bluesky) |
704 |
1408
|
3660.8 |
1,408 total system batch CPUs; 704
dedicated to CSL |
| Dedicated: |
SGI Origin3800 (chinook) |
124 |
64 |
124.0 |
124 CPUs dedicated to CSL
|
| Dedicated: |
SGI Origin2100 (chinookfe) |
8 |
8 |
4.0 |
Front-end system for chinook |
| Shared: |
IBM SP (babyblue) |
48 |
24 |
72.0 |
Shared new-release test platform;
available for user use |
| Shared: |
SGI Origin2000 (dataproc) |
16 |
32 |
8.0 |
Shared with Community for data
analysis and post-processing
applications |
| Shared: |
IBM NightHawk2 (dave) |
16 |
32 |
24.0 |
Shared with Community for data
analysis and
post-processing applications |
|
Community
Computing facility
The
Community Computing facility provided the following supercomputer
resources
available at the end of FY2003:
Community Computing
facility, FY2003 configuration
|
| |
System
|
# CPUs
|
GB
memory
|
Peak
GFLOPS
|
Notes
|
| Dedicated: |
IBM SP (blackforest) |
560 |
280 |
840.0 |
1,120 total system batch CPUs; 560
dedicated to Community |
| Dedicated: |
IBM SP
(bluesky) |
704 |
1408
|
3660.8 |
1,408 total system batch CPUs; 704
dedicated to Community |
| Shared: |
IBM SP (babyblue) |
48 |
24 |
72.0 |
Shared new-release test platform;
available for user use |
| Shared: |
SGI Origin2000 (dataproc) |
16 |
16 |
8.0 |
Shared with CSL for data analysis and
post-processing
applications |
| Shared |
IBM NightHawk2 (dave) |
16 |
32 |
24.0 |
Shared with CSL for data analysis and
post-processing
applications |
|
During
FY2003, SCD provided ongoing maintenance activities to
ensure the integrity and reliability of existing computational
systems and improved the quality of service to the NCAR user
community. Some of the key areas were:
Maintain
supercomputer operating systems
SCD stayed apprised of major software releases from IBM and carefully
scheduled upgrades to the production systems and product set software
based on the judged stability of those upgrades in the NCAR production
environment. SCD also continued to provide major system support for
the SGI Origin3800 and Origin2000 systems.
Maintain
stability and reliability of systems
One of the most significant attributes of the NCAR computational
environment is its overall stability and reliability. For instance,
the NCAR Mass Storage System has a reputation for reliability, and
SCD has in the last year deployed a number of high-availability
fileserver systems. This reliability and stability does not come
easily; it stems from a combination of choosing reliable, stable
vendor products and using proven, fail-safe system administration
and maintenance techniques. SCD will continue to focus on ensuring,
in whatever ways possible, highly stable and reliable systems and
systems operations.
System
monitoring
Over the years, SCD has developed a large number of system monitoring
procedures, techniques, and tools. SCD continued to enhance and utilize
its
collective experience to maintain the stability of the existing
production
systems through this proactive monitoring. In addition, SCD continued
to
enhance its monitoring tools, techniques, and procedures, and SCD
automated a
number of procedures for detecting system failure or trouble. This
automation
was integrated with commercial alphanumeric paging technology to
provide more
rapid alert mechanisms to SCD operations and systems staff and thus
reduce
the amount of time that systems are unavailable to the NCAR user
community
when they do fail.