DSG uses the popular "Big Brother" freeware system monitoring program to monitor certain systems. (You can get more information about BB at www.bb4.org.) This page discusses the meaning of various items displayed by SCD's monitor at sysmon.scd.ucar.edu. The monitor can be viewed in your browser, while you are "inside" the outer UCAR security perimeter, at: http://sysmon.scd.ucar.edu.
Note that each icon displaying status for a given item (usually green, yellow, or red) has information behind it. If you click on the colored icon, additional text information about the item will be presented as a second-level display. By clicking the "HISTORY" button at the bottom of the second-level display, you will see a third-level display presenting the history of that item for the past 24 hours.
Active monitoring of the Mail Relay nodes (mdir[1-2].ucar.edu and mscan[1-6].ucar.edu) is done via a Big Brother client install on each of these machines. A list of mail relay machines appears near the top of the display from http://sysmon.scd.ucar.edu. These machines are grouped under the heading "UCAR Mail Gateway (Linux Cluster)".
Also appearing in the display is the virtual address "mdir.ucar.edu". Either mdir1 or mdir2 serves requests to this address, by configuring the mdir address on a special interface. Only one of mdir1 or mdir2 can have this address configured at a time. The display entry for mdir.ucar.edu presents only the items "conn" and "smtp" because their state is actually determined by the Big Brother server on sysmon.
The headings "conn" and "smtp" are generic items which indicate there is connectivity to the machine, and that there is a SMTP agent present on port 25. The generic item "cpu" provides the number of users, number of processes, and current load on the machine.
The items most of interest on these machines are the four non-generic (custom) items named db, load, mprocs, and postq. These four items are what will give you the most insight into what is happening on the nodes of the relay system.
In addition to the information displayed behind each icon on the sysmon display, the four custom classes of information above include mail capability so that notification, and more detailed information in some cases, can be delivered to mailboxes on mail.ucar.edu. This mail capability includes a rudimentary history mechanism that keeps repeated mail messages from being sent once monitoring of one of the four types above has gone into an error state. However, if a machine is severely impacted by loss of a major service, e.g. if a machine has lost its name resolution daemon, this mail delivery feature cannot be expected to function in a timely manner. Therefore, keeping an eye on the web display from sysmon.scd.ucar.edu is the most important thing to do.
The "db" icon is intended to give you information about the health of the databases that are necessary to run the relay: DNS, LDAP, and MYSQL. It makes a simple request of a database; whereas the process monitoring of the "mprocs" icon tells you if a daemon is actually missing from the system.
The "db" icon can go red on any of the mdir* or mscan* nodes, if a simple DNS resolution request fails. If that occurs, "dns" will be appended to the text on the second-level page for the "db" item.
If the "db" icon goes red on any of the mscan[1-6] nodes, it can mean that a request for a well-known alias ("postmaster") could not be completed within 20 seconds. This probably means that the LDAP server daemon, "SLAPD", is hung or has died. If that occurs, "ldap" will be appended to the text on the second-level page for the "db" item. Note that SLAPD does not run on mdir[12].
The "db" icon can go red on only one of mdir1 or mdir2, whichever one is currently the director (i.e. is handling the mdir.ucar.edu virtual address) and thus is the active MYSQL server. (The other mdir node will always return green for mysql while it is standing by.) It goes red if a "mysqladmin ping" command cannot return a "server alive" message within 20 seconds, in which case "mysql" is appended to the text on the second-level page for the "db" item. This means that the "MYSQLD" daemon is hung, unless "mprocs" shows that it has disappeared from the process table.
The "load" icon extends the information also available from the "cpu" icon. The "cpu" icon provides information about the number of users, processes, and current load on the system. The "load" icon extends the load information by looking at the CPU's load averages for the past 1 and 5 minute periods. Both load values must be over specific limits to trigger the icon going into either a "yellow" or "red" state. Thus, "load" provides a better indication of whether a loading problem is continuing.
The "mprocs" icon is what will notify you if processes disappear from the system, that are necessary for correct operation of the relay. The processes of interest are: heartbeat/httpd/mysqld (on mdir* only); amavisd/clamd/slapd/spamd (on mscan* only); and crond/master/named/ntpd/sshd (on both mdir* and mscan*). Note that, in the last group, there is some overlap between the processes monitored by the generic item "procs" and the custom item "mprocs".
The name of any process that has disappeared from the process table, is appended to the text on the second-level page for the "mprocs" item.
Item "postq" is available only on the mscan* nodes. Its purpose is to monitor the length/content of the mail queue being served by the main Mail Transfer Agent, "postfix". To do this, it inspects the output of the "/usr/sbin/postqueue -p" command. If it finds more than 150 queue entries for delivery to ucar.edu addresses that are not in a "connection refused" or "connection timed out" state, it goes yellow. If there are more than 300 such queue entries, it goes red.