[Previous] [Table of contents] [Next]

High availability of distributed systems

As technology evolves, users expect rapid, significant improvements with increased reliability. What once was considered to be a "special case" is now generally commonplace. The users of NCAR's computer systems and networks demand more computing system reliability and stability than they have in the past. It is no longer acceptable for critical services such as electronic mail or Domain Name Service (DNS) to be interrupted, let alone unavailable for long periods of time. It is thus vital to protect such services with more robust system configurations that can sustain hardware and software failures without disrupting service to the users. A "high-availability" configuration of such systems ensures continued use of or access to a critical service despite system failures in either hardware or software.

The Scientific Computing Division has established a Machine Dependencies Committee that has reviewed system inter-dependencies at the Mesa Lab and targeted possible single points of failure. The two most important services requiring continuous availability are DNS and user authentication, since all other systems in the NCAR environment depend on these services. To ensure continuous availability of DNS and authentication services, the Machine Dependencies Committee recommended a high-availability configuration that would contain these services (code named the "Phoenix Project"). The Phoenix Project was placed into operation during FY1997 and was expanded to the Foothills Lab in the spring of 1998. DNS and user authentication services are among the set of operation-critical services provided by the "phoenix" systems. The security gateway server complex that was brought online in FY1998 was also configured for high availability.

The Distributed Systems Group (DSG) in the High Performance Systems section of SCD expanded the use of the high-availability configurations within the SCD computing environment in FY1999. Its main contribution was a new Sun file server (fileserver), an Enterprise 5000 with an A3500 storage unit that uses redundant power supplies, hot-standby network cards (dynamic routing), and dynamic reconfiguration that allows the system to bypass failed hardware. While this system is not fault-tolerant -- a level of hardware and software that guarantees uptime -- the system has less downtime than the previous SGI file server. Newer systems being acquired by DSG for critical applications will be equipped with a similar hardware configuration used by the new file server (e.g., redundant hardware components, dynamic routing) to minimize system hardware failures.

Previous attempts to use high-availability software on DSG systems, such as Qualix HA+, have been plagued with problems, and it was decided not to use commercial software solutions at this time. The Phoenix Project, which uses a hot standby to back up the main production server, has had great success in eliminating downtime. This requires manual intervention by a system administrator or operator to bring the backup system online, but to date has not resulted in any major production problems. This approach will be adopted for the new Sun workstation administration configuration and the new web complex of servers both due to come online in FY2000.

Further, the Office Systems Group's (OSG's) Wintel server configuration has taken advantage of a high-availability configuration to ensure uninterrupted Wintel client service. The servers run Microsoft NT server software, which supports the use of high availability in a production environment. The Microsoft fail-over software, unlike UNIX high availability, has been massively deployed throughout the computer industry and has proven itself to be extremely reliable.

SCD will continue to evaluate high-availability products to determine their suitability, efficacy, and cost-effectiveness in the SCD and NCAR computing environments. SCD will deploy those determined to be suitable.


[Previous] [Table of contents] [Next]