Machine dependencies
The Machine Dependencies Committee was formed in 1996 to determine the
dependencies that exist in the SCD machine room to:
- Recover from a complete machine room shutdown or crash
- Certify and document the correct sequence of boot procedures for all
machines and networks
- Determine the basic production environment for cases when complete
recovery cannot be obtained
- Define the full production environment
The ongoing boot time dependencies of every system in the room continue
to be defined as various systems are added to and removed from the
machine room. This effort incorporates a review of network dependencies, the
definition of various inter-system dependencies, and analysis of how these
dependencies change as systems are installed or removed.
Work completed in FY1998
The committee continued to update a "supercomputing-centric" boot time
dependencies list and diagram. This diagram identifies
predecessor/successor relationships by accounting for the prerequisites
(or dependencies) of these aspects of the system:
- Facility and environmental infrastructure (power, chilled water, air
conditioning, etc.)
- Networks and network infrastructure
- Phoenix systems (currently providing domain name service, license service,
and user authentication service)
- Mass storage equipment
- Supercomputers
- DCE server (MSS metadata command service)
- File servers
- Special-purpose servers (job submission, dial-up, home directories,
MIGS, etc.)
These dependencies are subcategorized as:
- Those required to power up the system and peripherals
- Those required to boot the system
- Those required to bring the system to a level where systems personnel
could use the system (i.e. restricted access)
- Those required for full, user-level production
Machine dependencies diagram at eFY1998. Click on image for detailed view
(113 KB).
Phoenix system
Phoenix systems were added to the Foothills Lab environment.
The committee developed the phoenix system concept as a basic tool to be used in
the recovery of the machine room. The phoenix machines are the starting point
for recovery once power, cooling, and the network infrastructure have been
certified to be functional. The most important services requiring continuous
availability are defined as those computers upon which all other systems
in the NCAR environment depend. The phoenix system is the set of computers
defined by the Machine Dependencies Committee that needs to be organized in
a high-availability configuration.
The concept of hot spare backup systems, which was already in place for
user authentication services, was extended to include one other critical
service: the Domain Name System (DNS), which is needed for machine address
resolution. These two critical services for both the Mesa Lab and Foothills
Lab were moved onto two new systems (one active and one hot spare) that were
specifically purchased for this purpose. The two independent machines in the
phoenix system run the most up-to-date hardware and software. Each machine
has mirrored disk drives and redundant network connections. Their
configuration prevents any single disk drive, network interface, or network
segment failure from bringing down the phoenix system.
Procedures have been established that will bring the hot spare into production
quickly in the event of a failure of the primary system. The phoenix systems
are designed to be 100% independent of any other computing systems, as shown
in the "Boot time dependency diagram."
The Distributed Services Group moved critical services such as software
license service, some NFS service, workstation boot service, print
service, and e-mail services onto multiple hot spare systems. This
increased the reliability of these services because the hot spares
could quickly be moved into service when the primary system failed.
Computer Production Group (CPG) contributions
CPG maintained the consistency and functionality of the "Boot time dependency
diagram." Edits were made when machines were installed, deinstalled, and
when critical services were transferred from one machine to another.
CPG used this information during scheduled facility power downs in October
and April.