1998 ASR Home
Back
SCD ASR Index
Next
SCD Home

Machine dependencies

The Machine Dependencies Committee was formed in 1996 to determine the dependencies that exist in the SCD machine room to:
  1. Recover from a complete machine room shutdown or crash
  2. Certify and document the correct sequence of boot procedures for all machines and networks
  3. Determine the basic production environment for cases when complete recovery cannot be obtained
  4. Define the full production environment

The ongoing boot time dependencies of every system in the room continue to be defined as various systems are added to and removed from the machine room. This effort incorporates a review of network dependencies, the definition of various inter-system dependencies, and analysis of how these dependencies change as systems are installed or removed.

Work completed in FY1998

The committee continued to update a "supercomputing-centric" boot time dependencies list and diagram. This diagram identifies predecessor/successor relationships by accounting for the prerequisites (or dependencies) of these aspects of the system:
  1. Facility and environmental infrastructure (power, chilled water, air conditioning, etc.)
  2. Networks and network infrastructure
  3. Phoenix systems (currently providing domain name service, license service, and user authentication service)
  4. Mass storage equipment
  5. Supercomputers
  6. DCE server (MSS metadata command service)
  7. File servers
  8. Special-purpose servers (job submission, dial-up, home directories, MIGS, etc.)

These dependencies are subcategorized as:

  1. Those required to power up the system and peripherals
  2. Those required to boot the system
  3. Those required to bring the system to a level where systems personnel could use the system (i.e. restricted access)
  4. Those required for full, user-level production


Machine dependencies diagram at eFY1998. Click on image for detailed view (113 KB).

Phoenix system

Phoenix systems were added to the Foothills Lab environment. The committee developed the phoenix system concept as a basic tool to be used in the recovery of the machine room. The phoenix machines are the starting point for recovery once power, cooling, and the network infrastructure have been certified to be functional. The most important services requiring continuous availability are defined as those computers upon which all other systems in the NCAR environment depend. The phoenix system is the set of computers defined by the Machine Dependencies Committee that needs to be organized in a high-availability configuration.

The concept of hot spare backup systems, which was already in place for user authentication services, was extended to include one other critical service: the Domain Name System (DNS), which is needed for machine address resolution. These two critical services for both the Mesa Lab and Foothills Lab were moved onto two new systems (one active and one hot spare) that were specifically purchased for this purpose. The two independent machines in the phoenix system run the most up-to-date hardware and software. Each machine has mirrored disk drives and redundant network connections. Their configuration prevents any single disk drive, network interface, or network segment failure from bringing down the phoenix system.

Procedures have been established that will bring the hot spare into production quickly in the event of a failure of the primary system. The phoenix systems are designed to be 100% independent of any other computing systems, as shown in the "Boot time dependency diagram."

The Distributed Services Group moved critical services such as software license service, some NFS service, workstation boot service, print service, and e-mail services onto multiple hot spare systems. This increased the reliability of these services because the hot spares could quickly be moved into service when the primary system failed.

Computer Production Group (CPG) contributions

CPG maintained the consistency and functionality of the "Boot time dependency diagram." Edits were made when machines were installed, deinstalled, and when critical services were transferred from one machine to another. CPG used this information during scheduled facility power downs in October and April.

1998 ASR Home
Back
SCD ASR Index
Next
SCD Home