![]() |
![]() |
The Operations and Information Support Section (OIS) provides 24-hours-per-day, 365-days-per-year problem identification and resolution tracking for all SCD computers and UCAR's network. OIS designs, builds, and implements Information Technology solutions to support the SCD mission. In addition, the section maintains SCD's computing center environmental and hardware infrastructure, including planning, staging, and all new hardware installs. OIS provides a reliable IT and physical infrastructure that incorporates problem detection; this is the foundation for the successful operation of NCAR's supercomputing facility.
Starting in early FY2000, the existing Computer Production Team is combining their traditional roles maintaining balance and overlapping responsibilities to meet the demands of a true 24-hour-a-day 365-day-a-year NOC (Network Operations Center). The existing CPG team already provides reliable problem reporting, resolution tracking, and 24-hour-a-day support for SCD and UCAR's supercomputing equipment. This includes the expansion of the Network Operations Center (NOC) in conjunction with SCD's traditional computer centers coverage. The CPG group quickly identifies software, hardware, and multiple network connectivity problems -- monitoring hundreds of devices and network connections.
The computing center at the Mesa Lab has been the traditional domain for the Computer Production team. The team continues to support production with the Mass Storage System, supercomputing resources, web, e-mail and other business-critical services for the smooth operation of SCD, NCAR, UCAR and the university community. In addition to the traditional computer operator role, the team has taken on the Network Operations Center role, supporting multiple groups.
The Computer Production Team identifies problems in three main network fabrics, the UCAR/NCAR network, the Front Range Gigapop (FRGP), and Boulder Research Administrative Network (BRAN). In conjunction with the NETS section, the NOC acts as the single point of contact for handling issues as they arise. This service has allowed UCAR and the members of the consortia to aggregate their funding, manage costs, and maintain a high quality of service. As we all know, networks are becoming critical to daily operations for all of us.The FRGP consists of the following members and has seen rapid growth in the past year:
- Colorado School of Mines
- Colorado State University system
- Denver University
- National Center for Atmospheric Research
- State of Colorado
- University of Colorado at Boulder
- University of Colorado at Denver
- University of Colorado at Colorado Springs
- University of Colorado HSC
- University of Wyoming
The BRAN member institutions include:
- The City of Boulder
- University of Colorado at Boulder
- National Center for Atmospheric Research
- Department of Commerce Laboratories
The Machine Dependencies Committee was formed in 1996 to establish the dependencies that exist in the SCD computer center. The Machine Dependencies diagram is used to:
- Recover from a complete machine room shutdown or interruption
- Certify and document the correct sequence of boot procedures for all machines and networks
- Define the full production environment and determines the basic production environment for cases when complete recovery cannot be obtained
The ongoing boot-time dependencies of every system in the computing center continue to be redefined as various systems are added to and removed from the computing center. This effort incorporates a review of network dependencies, the definition of various inter-system dependencies, and analysis of how these dependencies change as systems are installed or removed. The Computer Production Group (CPG) uses the dependency diagram to bring everything down in an orderly fashion and restore the systems to production.
During FY2000, we experienced three scheduled power-downs of the computing center. The dependency diagram has saved hours in the painstaking process of bringing the computing center machines up and down. Back in 1986, it took over eight hours to recover a completely down room. This year, the room was recovered in about two and a half hours from a completely down state.
The continuing rapid pace of technological change was readily apparent with the equipment installations and removals for FY2000. The Cray C90 was decommissioned and later removed from the floor. Two Cray J90s were decommissioned as well, along with numerous smaller systems.A major project for FY2000 was the relocation of the STK Silo's and the addition of two more. Three silos that were on the raised floor were moved to free up valuable raised floor space. The relocation also allowed for a more efficient layout with the addition of two more. The blackforest system was upgraded from two-processor nodes to four per node, and the memory was doubled. OIS played a key role as this represented a significant logistical effort. Finally, a Compaq ES40 cluster was added to the environment in January 2000.
The computer room infrastructure continues to be improved. Additional air handling capacity was added to the east portion of the computer room to provide some redundancy for the smaller server equipment. Additionally the heat recovery system was modernized and is now fully functional; this allows the excess heat that is removed from the computer room to be used in heating the rest of the building. Finally, the DataTrax system was upgraded to a more modern client server system that allows remote viewing of the environmental conditions, UPS status monitoring, and other important infrastructure considerations.Another major infrastructure project that occurred in FY2000 was the design and buildout of a new operator console area. This offered several advantages; the operator staff is now isolated from some of the noise and has a more ergonomically correct working space. Additionally, the relocation freed up valuable raised floor space that will be needed with the addition of equipment anticipated in FY2001.
In FY2000, the Infrastructure Applications Group pushed the Trouble Ticket System to full production with extensive staff training, both in seminars and on an individual basis. IAG also upgraded to Remedy 3.2.1 server and upgraded the client to 4.0. The client upgrade eliminated any noticeable display difference between the Unix and Windows clients, thereby reducing user confusion and administration difficulties.Operations staff designed and implemented user enhancements to the system, including a spell check application utilizing OLE Technologies, enhanced views for the Computer Production Group (CPG), and refinement of automated reports. Work continues into FY2001 to redefine request types that will provide more intuitive categorization for the user, thereby further streamlining the trouble ticket process. In addition, work will now commence in defining and implementing notification and escalation procedures for each section and group within the division.
Significant progress was made in FY2000 in defining and implementing contract maintenance procedures and processes for the inventory system. A couple of final procedures remain to be put in place, but overall the system is operating as planned and has succeeded in providing excellent reports and information to help track SCD's assets. OIS has already realized significant staff time savings during hardware maintenance contract renewals due specifically to the inventory system.Ongoing work remains with populating the database and creating relationships between the data as well as some small development work relating to escalations, notifications, and categorizations. These details will be finalized in FY2001. The recent addition of a Contracts Administrator this year will significantly increase the success and usefulness of this system.
In FY2000, IAG completed several web-based applications that allow SCD to automate and streamline processes throughout the division and organization.
An automated archival system was developed to retrieve vendor performance management requests (trouble tickets) from their servers and store into our in-house remedy database. A web-based application was then developed to access these performance management requests, display them in a user-friendly fashion, and allow SCD staff to add comments and details to the open tickets.A report from this system was provided for the SPXXL users group meeting in FY2000. The system was the sole source for detailed vendor service call information, resulting in significant negotiated maintenance cost savings for SCD.
The system continues to function on a daily basis. There are a few code enhancements necessary to ensure flawless operation; these are scheduled for FY2001.
The RAS web-based form was enhanced to provide user database lookup functions upon submission. This system allows UCAR/NCAR community users to apply for Remote Access accounts via the web, and use the lookup features to automatically verify the prior establishment of a computer resource account.
A web-based reporting application was designed and implemented in FY2000 to provide easy viewing access to the inventory system for all SCD staff. This application allowed each individual user to check for assets assigned to them, and it quickly notifies the database administrator of any inconsistencies or errors. The use of this system greatly reduced ISG staff time in tracking desktop systems.There are plans to expand the functionality of this system in FY2001 to incorporate some auto-discovery tools.
In FY200 a need arose for a project management system to track division projects. IAG developed a web-based project management system that utilized the remedy database. The system was designed to allow for the tracking of, and reporting on, three-tiered projects by section, group, or individual.This project was successful and has been fully deployed and moved into production in mid FY2000.
The Infrastructure Applications Group (IAG) has been charged with researching, designing, developing and implementing a knowledge/application management portal for SCD to streamline processes, make daily tasks easier, reduce redundancy, decrease document searching time, and increase information usage in the decision-making processes throughout the division.The group initially focused on determining technical requirements for an off-the-shelf (OTS) partial solution; IAG has made significant progress researching and evaluating several vendors and products. The work from a student assistant contributed immensely. A recommendation is expected early in FY2001. Design, development, and implementation are projected to span through FY2001 and possibly into FY2002.
Information about our web portal research in FY2000 appears in the research section of this report.
Throughout FY2000, OIS identified several areas that could be improved to eliminate duplication of effort and increase staff efficiencies. The ultimate goal of streamlining processes is to enable SCD staff to better support the university community. Two areas that were identified are monitoring and user authentication.
Monitoring system Currently SCD has numerous systems for monitoring the networks, servers, and supercomputers. A significant portion of the maintenance is being handled by numerous staff throughout the division. Late in FY2000, a committee of these staff was formed to identify the monitoring needs and requirements. This work was completed, and in FY2001 a group will be looking at different packages to implement a global solution for the division. Several collaborative efforts are in the works with NCSA and possibly NSA.
User authentication and authorization
Another area identified for streamlining are the processes to set up user accounts and handle authorization. Currently it is possible for a user to have at least ten different authentication methods and more passwords to keep track of. A centralized password scheme was proposed and is being investigated. This would offer several advantages, including reducing the effort and number of different tasks that are required for setting up user accounts. It would also centralize the password databases and make it possible to better secure them. It is noted that a single password does provide a single access point to everything, but it is widely assumed that because of the number of passwords currently required, users frequently set them to the same thing. This will be an ongoing effort in FY2001.
SCD ASR - Table of contents