[Previous] [Table of contents] [Next]

Operational procedures and infrastructure

Overview

The Operations and Information Support (OIS) Section monitors SCD's computers and UCAR's networks 24 hours a day, 365 days per year. The section maintains the necessary environmental and hardware infrastructure and manages user accounts and allocations. They generate the daily bulletin and daily weekly and monthly reports for the effective management of the computer resources. The groups within the OIS section are: the Computer Production Group (CPG), the Infrastructure Support Group (ISG), and the Database Services Group (DBSG).

Monitoring and initial problem identification

The Computer Production Group (CPG) monitored SCD's computing resources 24 hours a day, 365 days a year. This group promptly identifies software, hardware, and network problems consisting of hundreds of devices and network connections. This allows for timely resolution of problems and minimizes interruption to service.

In FY1999 CPG started using the Remedy Trouble Ticket system to report system downtimes and events in the room. The use of this system reduced the number of duplicate tasks that were completed from 12 down to about 4. This was accomplished primarily through automating a number of notifications. For FY2000, CPG plans to work with HPS in refining this and a number of other reporting systems currently in place. The goals for this project include:

  1. Eliminate redundant reporting
  2. Increase the quality of the information reported so it can be used more effectively
  3. Target the reporting toward those individuals who can make use of the information

Machine dependencies

The Machine Dependencies Committee was formed in 1996 to determine the dependencies that exist in the SCD computer room to:
  1. Recover from a complete machine room shutdown or crash
  2. Certify and document the correct sequence of boot procedures for all machines and networks
  3. Determine the basic production environment for cases when complete recovery cannot be obtained
  4. Define the full production environment

The ongoing boot time dependencies of every system in the room continue to be redefined as various systems are added to and removed from the machine room. This effort incorporates a review of network dependencies, the definition of various inter-system dependencies, and analysis of how these dependencies change as systems are installed or removed. During FY1999, the dependency on the three Uninterruptable Power Supply (UPS) units was added. Here is a 155-KB image of the current Boot time dependency diagram.

The committee has reduced the frequency of meetings in FY1999 as most of the major tasks were completed in FY1998. The Computer Production Group (CPG) maintained consistency and functionality of the Boot time dependency diagram. In FY1999 the work of the committee was put to the test on two occasions. In early February high winds caused a power outage to the room of several hours. During that time, the UPS capacity was exhausted and all of the systems in the computer room lost power. After power was restored, the computer systems were brought back to a production state within four hours. This was the easiest and shortest power up from a complete down state that has ever been accomplished. This success was attributed to the work of the committee.

The second event involved a problem with the installation of a new UPS unit. The installation required the orderly shutdown and restart of the equipment in the room to safely bring the new UPS unit online. The Computer Production Group (CPG) used the dependency diagram to bring everything down in an orderly fashion and restore the systems to production. While the restart was not quite as successful, a critical RAID system failed, a majority of production capability was again restored in about four hours.

Equipment installations and removals

In late FY1999 the computer room experienced a massive number of changes. A third automated cartridge system was installed initially for 9840 beta test and was moved to production late in FY1999. The Infrastructure Support Group (ISG) was responsible for planning and installing a 16-node IBM RS/6000 SP and a 128-node IBM RS/6000 SP in the summer. To facilitate this installation, the Cray T3D was removed, and the 128-processor SGI Origin2000 was moved to a different location in the computer room. Planning was completed for the installation of a Compaq ES40 proof-of-concept system and for the removal of the Cray C90.

Floor space

With the installation of the large IBM RS/6000 SP system, the pressure on available floor space has increased. The new computer equipment takes significantly more space because it is less densely packaged to enable air-cooling. ISG will continue to monitor the space requirements and try to maintain as much flexibility for SCD's future plans as possible. To help with some of the floor space issues, ISG and CPG worked together to change half of the tape archive racks to high-capacity tape racks. This process was intensive and required moving more than 170,000 cartridges, but it almost doubled the number of tapes per square foot and increased the amount of floor space available.

Infrastructure

In FY1999 ISG completed several major projects enhancing the infrastructure of the computer room. The UPS units that were purchased in FY1998 were installed. The installation of the first UPS unit had several problems. The engineering of the project was completed by the service component of the manufacturer. However, the engineering did not point out a phase mismatch problem between the maintenance bypass transformer and the transformer internal to the UPS. Once the unit was bypassed, it was impossible to safely bring the UPS online without cutting power to the computer room.

This required the orderly shutdown of the whole computer room and then bringing the room back online. Because this problem was unexpected and required a large effort by a majority of the SCD staff, a report on lessons learned was prepared, and a presentation for SCDUG was given. The installation of the second UPS unit was completed in March and went very smoothly.

Trouble Ticket system

The full implementation and rollout of the Trouble Ticket system neared completion in FY1999. A major focus throughout the year was moving individual groups from their old reporting/request tools to the new Remedy-based Trouble Ticket system. The Trouble Ticket system is nearing full production in the division, and the final steps will be taken to convert one remaining group in early FY2000.

On the development side, there has been significant progress on developing a web-based interface to the system. The addition of two student assistants has accelerated progress. The web interface will provide the user community with an easy way to report problems and ask for assistance, all from one central location. In addition, the web interface will enable SCD staff to modify and work on requests via the web when they do not have access to the Remedy Client.

Inventory system

The Infrastructure Support Group (ISG) is responsible for the physical installation, tracking, and maintenance contract management of a majority of SCD's equipment. The group started on a new system to make these critical management tasks easier. After some evaluation, a product from Remedy was selected. This fit well because it can interact with the existing Trouble Ticket system to provide meaningful information.

The project has been successful and was put into production use late in FY1999. There are some issues that remain to be worked out with process and procedure. These details will be finalized in FY2000.

Site licensing

SCD's OIS performs site and volume licensing of various software for UCAR as a whole and for SCD. In general, licensing software for the entire organization benefits SCD and represents a significant savings to the organization. Examples of this are the site licenses for educational vendor programs, such as Sun's ScholarPAC and SGI's Varsity Program. Administration of these licensing programs and distribution of software through a single contact saves UCAR over $500,000.00 a year.

New in FY1999:

Resource allocation improvements

In FY1999, SCD's Database Services Group (DBSG) continued development of software to make the application and allocation process for computing resources easier for users, reviewers, and facilitating staff. The panel book for the October 1998 and May 1999 meetings of the SCD Advisory Panel was all electronic, allowing for dynamic rebuilding as last-minute applications arrived. The electronic format gave reviewers earlier access to requests and saved the cost of producing and mailing numerous, thick, hardcopy panel books.

In October 1998, the SCD Advisory Panel approved a new category of requestor who could receive an allocation: a university faculty member or research associate in the field of atmospheric sciences or related sciences who has obtained their Ph.D. within the past five years. These requestors may obtain 100 GAUs without NSF sponsorship just like post-docs and graduate students. The panel also approved the allocation of up to 10 GAUs for two years for university faculty for small data-access accounts without NSF sponsorship. Previously faculty would have to purchase these resources at a cost-recovery rate.

Database Y2K preparation

The Database Services Group uses Oracle software to track SCD's trouble ticket data and to track SCD's computer resources and associated computer project data. To prepare for Year 2000, a new Y2K-compliant version of Oracle software including Oracle tools was installed, and the databases were upgraded to run under the new version. Oracle tools were installed that provide for data entry screens, networking, and the execution of Fortran programs and PL/SQL procedures.

Current scripts and programs were tested to ensure that they work with the new version of Oracle. Oracle web server tools were installed and web listeners were created so that web applications can store data directly into the Oracle database. The test environment will be integrated with the production environment, and a cutover to the Y2K-compliant version of Oracle is planned for October 1999.

Management of user accounts and reporting

During FY1999, SCD Database Services Group (DBSG) managed 1,133 user accounts in 715 projects, not including users and projects that only had mass storage charges. Approximately 50% of the users are from universities. Over 1,000 of the users had usage on the Cray or SGI machines, and 959 used the Mass Storage System to read or write files during FY1999.

During FY1999 DBSG implemented policies for inactive users with mass store files and for inactive projects with mass store files. A new policy for inactive projects which need to retain critical files supporting pending and recently published journal articles was developed and approved by the SCD Advisory Panel.

An additional database was created to run with Oracle software. This database will allow computer users access to detailed Mass Storage System metadata using tools developed by SCD. An Oracle option called partitioning was installed to facilitate management of the more than 20 GB of data in this database.

OIS section members will continue to focus on reduction in MSS files by inactive users and projects, and we will also work individually with university PIs who need to reduce their MSS growth rate to maintain it within SCD's targets for universities.


[Previous] [Table of contents] [Next]