Troubleshooting IBM Etherchannel Connectivity

IBM's AIX has the capability to bond two gigabit ethernet interfaces into an etherchannel for improved reliability and capacity. While this is in theory a good thing, it's current implementation is less than intuitive when debugging from the network side. This document explains the general operation of the feature as well as some specific problems which have cropped up.

There are two modes which are currently in use on the IBM supercomputers. For lack of better terms, they will be referred to as 'link' and 'hash'. The link mode uses one active interface and shuts down the backup interface. If the primary link fails, the backup link is activated to take it's place. The hash mode activates both links, and sends packets down one link or the other based on a hash of the IP header. Both are sub-optimal in their visiblity to the network.

The link mode is simplest to debug. In normal operation, only a single interface is being used. As a result, it can be debugged just like any normal single network connection. The downside is that because the backup interface is kept in a down state, it appears to be Inactive according to the port lists. In fact, it may accumulate hundreds of days of inactivity. Worse, if for some reason the backup port has a problem, whether it be cabling or configuration, the problem won't be discovered until the primary interface fails.

The hash mode is more robust, but harder to troubleshoot from a networking point of view. On the host side, the two ports are bonded into a single logical interface. However, on the switch side, the two ports are two independant connections. The result is that the CAM entry for the hosts MAC address is not associated with both ports. Rather, it is associated with one port or the other at any given time. But never both. It switches the CAM entry back and forth based upon the data flow from the host.

This mode is insidious from a debugging point of view. Normal debugging steps on a switch are usually to find our MAC address of interest, locate the single port it's connected to (either via 'show cam' or the port lists). Once the single port is located, link and errors can be verified via 'show port' and traffic flow can be verified via 'show mac' and the netstat graphs. However, in hash mode, the single located port may be 'good' in every sense of the word. It has link, it has no errors, and it is carrying a significant flow of traffic. Yet at the same time, the second host interface may be misconfigured and causing significant connectivity problems for the host.

If the second port is experiencing link problems, the redundancy properly works since the host can detect this and will only use the single remaining good port. If the second port is misconfigured because it is, for example, in the wrong VLAN then the host will continue to send traffic down the link where it will be lost due to it's landing in the wrong VLAN. The symptom from the point of view of the host is that some destinations are reachable and others are not. Packets which are sent down the good link based on the hash algorithm arrive at their destination. Packets which are sent down the bad link do not. The hash is based on the IP addresses involved, so some connections will work perfect while others will fail entirely.

When a hash etherchannel is working properly, repeated invocations of 'show cam' for the target MAC address should occasionally return different results. For example:


ml-mr-c1-gs> show cam 00-02-55-9a-3b-cc
* = Static Entry. + = Permanent Entry. # = System Entry. R = Router Entry.
X = Port Security Entry $ = Dot1x Security Entry

VLAN  Dest MAC/Route Des    [CoS]  Destination Ports or VCs / [Protocol Type]
----  ------------------    -----  -------------------------------------------
215   00-02-55-9a-3b-cc             9/15 [ALL]
Total Matching CAM Entries Displayed  =1
ml-mr-c1-gs> show cam 00-02-55-9a-3b-cc
* = Static Entry. + = Permanent Entry. # = System Entry. R = Router Entry.
X = Port Security Entry $ = Dot1x Security Entry

VLAN  Dest MAC/Route Des    [CoS]  Destination Ports or VCs / [Protocol Type]
----  ------------------    -----  -------------------------------------------
215   00-02-55-9a-3b-cc             10/25 [ALL]
Total Matching CAM Entries Displayed  =1
ml-mr-c1-gs>
If the MAC address never moves, or shows up in two VLANs, that would be cause for further investigation. However, note that some machines legitimately have two ports in different VLANs with the same MAC address. This is most common on Sun's, so that in and of itself cannot be considered a sign of problems. Also, most MAC addresses never move so that is also not certain sign of a problem. However, if you are debugging an AIX etherchannel connection which SSG has confirmed is in 'hash' mode then these should be considered signs of a problem.

The best long term solution to this problems is to determine if the switch and host can be configured so that both sides know the two links are bonded. That way, it will be possible to determine both ports which need to be considered during the troubleshooting process. This will also provide the host with increased capacity from the switch to the host since the switch will see the two links as a true etherchannel. Optimally, the channel can be negotiated between the switch and host using either LACP or PAGP. This should help ensure that if the channel exists on the switch side that all cabling and configuration is good. If the channel is hard-coded on the switch it may be the case that the channel could appear to be 'up' even if the two links went to different hosts.

IBM Documents

Here are some links to IBM documentation on the subject.