1998 ASR Home
Back
SCD ASR Index
Next
SCD Home

Supercomputing system advances

The Supercomputer Systems Group (SSG) of the High Performance Systems section of SCD is responsible for systems engineering, support, and administration of all production computational systems managed by SCD. While the group continued performing routine system maintenance on existing supercomputer systems, the introduction and integration of new Distributed Shared Memory (DSM) and Parallel Vector Processor (PVP) systems into the Climate Simulation Laboratory and Community Computing environments were the primary focus of activities in FY1998.

The most significant project accomplishments for SSG during FY1998 were:

In addition, SSG continued to closely monitor all compute servers and made changes as needed to ensure that we continue to run highly tuned, effective systems that will yield the maximum utilization and throughput that these systems can provide. Installation and integration of the Origin2000 (ute) into the CSL SCD installed a new 128-processor Silicon Graphics Cray Origin2000 system during the spring of 1998, and SSG played a major role in the installation, system tuning, HiPPI testing, porting of local codes, system debugging, and testing of Silicon Graphics products such as compilers, batch systems, and system performance tools.

In addition, SCD acquired a small four-processor Origin2000 system (mouache) for system testing purposes prior to the installation of the large system. This system has been used extensively by SSG staff and the Mass Storage System Group (MSSG) to conduct Irix Operating System tests and NCAR Mass Storage System (MSS) testing and to evaluate a number of other system capabilities prior to the introduction of those capabilities on the larger, production Origin2000.

SSG installed, tested, and implemented the following IRIX system components:

SSG facilitated the system installation and made the Origin2000 a viable production compute server for the Climate Simulation Laboratory (CSL).

Installation and integration of the Cray J90se/24-1024 (chipeta) into the Community

A second Cray J90se with 24 CPUs and a billion words of memory was added to the Community supercomputing resources during FY1998. The installation and acceptance of the system was flawless, largely due to the fact that the system was configured identically to its predecessor J90se (ouray). On 24 March 1998, this second J90se system was put into production and became fully utilized within four hours, and has effectively remained so since. In keeping with the convention of recognizing American Indian tribes and leaders established for the J90 series systems at NCAR, this system was named "chipeta"; Chipeta was Chief Ouray's wife.

Integrating the SPP-2000 (sioux) into the Community

Hewlett Packard's (HP) SPP-2000 Exemplar has been at NCAR since late April 1997. SCD spent the first few months of FY1998 evaluating the system to determine its readiness for introduction into the Community supercomputer resource "pool." Though there were numerous operating system feature deficiencies, the system was placed into a "friendly" user state, then into a "limited production" state by spring 1998. The operating system provides no accounting capabilities, thus SCD has left the system unallocated but available to any Community users interested in obtaining an account and using the system.

During the testing and evaluation of sioux, SSG addressed and identified the following system problems:

SSG continued to closely monitor sioux and added, changed, or tuned resources as necessary to ensure the best possible system performance and job throughput. To date, this system has been relatively well utilized and utilization has been increasing with time, but only by a small number of Community users; SCD had hoped to encourage more Community users to jump onto the DSM "bandwagon" by not charging for sioux's use.

Batch Priority Scheduler (BPS)

BPS was initially developed in 1996 to meet SCD's job scheduling needs on the Parallel Vector Processor (PVP) systems that could not be met with the vendor-supplied batch queuing software. Some features include round-robin by proposal scheduling and priority-based pre-emptive scheduling. Also, near-dedicated jobs -- jobs that need almost all of a computer's resources to run -- can be automatically scheduled, as can jobs with other special needs (such as very high memory requirements). Calendar-based scheduling options were added to BPS in 1997 to start and stop batch queues automatically on certain days or at certain times of the day. This feature also gives us the ability, for example, to give higher priority to low-memory jobs during the day and high-memory jobs at night.

During FY1998, BPS managed production workloads on all the Cray supercomputers and was ported in early FY1998 to the Silicon Graphics Power Challenge (winterpark). This port has provided SSG with more control over the queues and batch jobs on winterpark and allows us to implement queue and job policies as needed.

Several enhancements were made during FY1998, including:

Network Queuing System and File Transfer Agent (FTA)

The primary purpose of the NQE Client/FTA project was to replace MASnet remote batch job submission. In addition, the NQE Client/FTA project transfers reliance from a locally developed and supported product to an off-the-shelf vendor-supported product. MASnet remote batch job submission relies on obsolete USCP (UNICOS Station Call Processor) hooks in the NQS code. Continued vendor support of USCP hooks in NQS is uncertain.

NQE clients and FTA enable a user to submit, delete, and status-check an NQE batch job from a remote system, such as a desktop workstation or departmental server. In addition, the NQE batch job's standard out and standard error files are returned to the system from which the job was submitted.

FTA is the underlying transport mechanism used for moving files between NQE clients and NQE execution servers. FTA has been configured to provide reliable transport service, meaning that FTA transfers that have failed due to network problems are retried until the transfer is successful. FTA has also been configured to use peer-to-peer authorization, which allows FTA file transfers to take place without sending passwords across the network.

NQE clients and FTA were installed on the MIGS, meeker, and niwot systems in SCD and on several MMM, CGD, HAO, and ACD hosts. FTA was configured on the Cray C90 (antero), Cray J90/20 (aztec), Cray J90se/24 (ouray and chipeta), Cray J90/16 (paiute), and Silicon Graphics Power Challenge (winterpark), and the primary NQE transfer agent on these hosts was set to FTA.

System monitoring and reporting

SSG maintains a suite of system monitoring utilities (known collectively as "sysmon") on all compute servers; these utilities monitor the servers and log critical system information. Currently the sysmon software routinely sends SSG brief reports on system utilization, error and warning conditions, and system daemon status. This software also keeps track of MSS activities on the supercomputers and alerts SSG and the SCD Computer Production Group (CPG) staff when anomalous conditions occur.

Sysmon has been a very useful tool for SSG and CPG. SSG did some enhancements to further automate the operation and monitoring of supercomputer systems.

In addition, in late FY1998, the High Performance Systems section of SCD, in cooperation with CPG, developed and began the operational deployment of additional system monitoring capabilities which are integrated with commercial paging services. Experience to date has indicated that this additional notification capability may free CPG staff from some of the more mundane system-operation tasks while providing an even more timely alert mechanism to potential problems with the production supercomputers, Mass Storage System, and server systems.

System support for current supercomputers

SSG continued to provide, as its primary responsibility, system support for the current production supercomputers, and delivered the same level of support as we have in the past. However, our resources were divided between that and learning how to manage the new Distributed Shared Memory (DSM) systems like the Origin2000 and the SPP-2000 supercomputers in the NCAR computational environment.

SSG tracked vendor system releases and upgraded all the supercomputers to the latest levels of software. For instance, UNICOS 10.0, which adds Year-2000 compliance and enhanced reliability was installed on all Crays during the spring of 1998. More information appears in the Maintenance of the existing production supercomputer environment report.

Year 2000 planning and testing

SCD conducted some assessments and evaluations during FY1998 and concentrated on upgrading all the supercomputer systems to be Year-2000 compliant. More testing will be done in FY1999. More detail appears in SCD's Year 2000 planning and testing report and in SCD's Y2K Overview and Plans.

In brief, SSG's Y2K activities during FY1998 included:

In addition, now that these systems have been upgraded, SSG intends to conduct a sequence of careful, isolated tests of these systems for Year 2000 problems that may exist in operational, administrative, and/or usage procedures.

1998 ASR Home
Back
SCD ASR Index
Next
SCD Home