From: irwin@ucar.EDU (B. Lynn Irwin) Date: Thu, 18 Sep 1997 22:22:00 GMT To: vandyke@ncar.ucar.edu, siemsen@ncar.ucar.edu, fair@ncar.ucar.edu, grissom@ncar.ucar.edu, marla@ncar.ucar.edu, irwin@ncar.ucar.edu Subject: network fragileness Analysis of a Fragile Network ----------------------------- I thought I'd write up some of the discussion Jim, Pete, and I had at lunch today about the fragile nature of our current network structure. Since the lunch converstations, I've talked with Chris and again with Jim. While what's being proposed here is a product of all of our thinking, it represents my own viewpoint since I'm proposing it. We are very concerned that a reasonably high probability exists that the entire UCAR network could go down in such a way that it could be almost impossible to get it back up again. For example, there are several scenarios in which a network failure would prevent access to the host containing backup Cisco configurations. Should a network device need to be replaced, it would be real tough to configure it under those circumstances. The fundamental problem is that the UCAR network is currently structured in such a way that there are several places in which a single failure can take down the entire network, and we have not provided a reliable out-of-band method for accessing devices for diagnosis, nor have we provided a reliable method for out-off-band restoration of damaged or lost configurations. Our current situation is substantially different than in the past in which networking robustness was inherently distributed and fundamentally directly implemented by physical network segments and the great number of router ports it took to interconnect these segments. A failure of a router port, a router, or a network segment would usually leave enough of the rest of the network functional enough that the network itself could be relied upon as an underlying tool useful for diagnosing and repairing the failed parts. Now the situation is quite the opposite. The current network is dependent upon a very few routers and switches with a very few ports. Worse yet, the whole strucuture is highly dependent upon complex and centralized software that resides in a very few of these network components. Now there is a whole new class of network failures that will bring down ALL of the network, thereby leaving little or no network functionalitiy as an aid in diagnosing the broken parts. Even worse, almost all hosts will incapacitated as well since almost all hosts are network dependent. So now we've lost all of the network and practically all of the hosts. Essentially, there is nothing to telnet to, there is nothing to telnet over, and there is nothing to telnet from. Fortunately, there are several ways for providing these reliable out-of-band access and downloading capabilities, but they all share one common attribute: they must have NO dependencies on any in-band network services. The issue is that we haven't implemented any of these out-of-band networks and that we need to. I'll list and discuss a few possibilities. PCs With Terminal Emulation Software ------------------------------------ One of the simplest things to do would be to have at least one PC configured with terminal emulation software to attach to the serial ports of suspect devices. However, to be useful in a crisis situation, the following must be true: 1. The PC must be locatable instantly. This means it is not allowed to be checked out and can't be someone's personal computer. All parts, including powercord have to there as well. 2. Terminal software must be installed and people must be familiar with its use. 3. The PC must contain an up-to-date copy of all configuration files. If the files are only on a stationary computer or a computer that is dependent on the in-band network, then these files will not be available to restore a damaged or missing configuration file. To be up to date, these files will need to be periodically downloaded from their central location. (It is also assummed that all such configration files are kept in a well known central location and are kept up to date there as well.) There must be some way to cut and paste the PC files into the terminal emulation software on the PC. 4. Permanently kept with the PC must be at least one dedicated and labeled serial cable for each type of device to be serviced by the PC. These serial cables must be incapable of being easily disassembled, such that pieces of the cable are easily removed to be used for other purposes. A set of serial cables as described is necessary because it is not prudent to be spending a large amount of time trying to build a serial cable that works while in crisis mode, e.g., the main ATM switch is down or the main LANE server is dead and we can't get the damn PC to talk to it because we can't get the right serial cable put together because we can't find the pieces. The major drawback to the PC solution is that only one device at a time may be examined. Also, it will be a pain to keep config. files up to date. Out of Band Serial Network -------------------------- Another solution is to build an out-of-band serial network with NO dependencies on the in-band network. Such a network would consist of stand-alone "serial access servers" which are connected to the serial ports of all in-band network devices and the serial ports of a few stand-alone hosts. These hosts could access any in-band device through the serial network. If the serial access servers require external network access for booting or security information, then it would be necessary to provide a standalone server on a special network private to the serial access servers. Such a configuration would be required at each major site. A modem or two would be useful as well. An up to date set of all configuration files would need to be kept on at least one of the standalone hosts at each site. Use PVCs to Build a LANE-indepedent ATM-dependent Network --------------------------------------------------------- We should define a PVC-based subnetwork (probably a Classical IP network) with each 5000's ATM card copnfigured to that network, and with each ATM router configured to that network and each ATM switch configured to that network, along with one or more ATM-attached stand-alone hosts containing a copy of all of the configuration files. This could be a relatively simple PVC network, with PVCs to each network device from only one or two standalone hosts. One host each at ML and FL might suffice. Note that the proposed Classical IP network would be separate from any production ones, which are likely to use ATM ARP servers. Such a configuration is LANE-independent, and gives very good concurrent access to all crucial network devices except during a complete switch failure. This solution isn't terribly difficult, scales well, and offers a high-probability of access during most network failures, though it isn't completely independent of the in-band network. Build a Physical Out-of-band IP Subnetwork ------------------------------------------ Another thing that could be very helpful would be to build a LANE-independent physical IP subnetwork among all 5000s on a site. This could be the same independent network used to link the serial access servers. Again, one or two standalone hosts containing configuration files could attach to this network. In fact, this principle could be generalized such that a physical (LANE-independent) IP subnetwork should run through all of a location's data-communications closets as a management network to which all ethernet-capable devices in the closet could attach. This network could be constructed using repeaters and/or a combination of VLAN ports of the closet's local 5000s. In most ways, this is probably the best solution. (When I mentioned this solution, Chris said that this is what Cisco actually recommends should be done.) After discussion with Jim, it was clear that a simple fiber-based repeated Ethernet rooted in the ML/FL computer rooms with a fiber link to a repeater in each closet (that has Enet-capable equipment) could be built with not too much expense. Use Robust Configurations to Start With --------------------------------------- There are a few other measures that would be useful for limiting the impact certain kinds of network failures would have on the the network as a whole. 1. We need to be careful with the ATM addresses used for the LANE LECS, LES, and BUS servers. Currently, the default addresses for these servers is MAC-address dependent upon the supervisor engine on older model 5000s, while they are MAC-address dependent on the chassis on newer model 5000s. The problem is that the ATM server addresses in the LECS database and the LECS ATM address in the ATM switches must be changed manually when these MAC addresses change. Since chasis swapout is less probable (and usually more controlled) than a supervisor failure, the problem isn't as great for the newer 5000s. This means that we should run the LANE servers on the newer 5000s. Since we configure two instances of the LANE servers each at both ML and FL, this means we need four ATM cards running in newer chassis. At ML, the two servers already reside on two ATM cards in the new 5500, and shortly, the 5500 will handle one instance of the ATM servers on a single ATM card. We should add a second ATM card to the FL 5500 for the second server instance. This also gives both FL and ML the same redundency. In general, we should select LANE server ATM addresses that are not dependent upon components with a reasonable chance of failing, and furthermore we should never use a hardcoded method of distributing or determining the LANE ATM server addresses whenever a dynamic method is available. 2. We probably ought to do some more ATM LANE failover tests. (yuk) Finally, we probably ought to do ALL of the things discussed here. Each of the proposed out-of-band methods has its weaknesses and strengths, and in fact I think they complement each other. We should pick the low fruit first. We should implement the PVC Classical IP management network and also configure and set aside a PC with serial cables. The next easiest thing to do is probably to build the physical management subnetwork. The hardest thing to do is probably to build a standalone serial network, but this might not be necessary if all of the other options are implemented. In all of these solutions, we need to make sure that some of the ATM-attached and some of the management-network-attached workstations are stand-alone and contain copies of backup configuration files.