|
|
| Browse NETS topics: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | |
Nagios is an open-source program for monitoring networks. In March of 2006, NETS started using Nagios to monitor UCAR and FRGP networks. Before then, HP Openview was used. Nagios is simpler and cheaper than Openview.
For information about how the NCAR NOC monitors UCAR divisional machines, see the NCAB Host Monitoring Policy.
Some of what's described here was learned by reading Nagios System and Network Monitoring by Wolfgang Barth. There are other books on Nagios, I chose this one because it's recent.
To access Nagios, go to http://nagman.scd.ucar.edu/nagios/
To access the backup Nagios server, go to http://fl-nagman.ucar.edu/nagios
Find out more detailed information about the relationship between nagman and fl-nagman.On ~2007-07-01, the Nagios 2.4 tarball install was replaced with Nagios 2.6 from the Debian archive. As root:
apt-get install
nagios2 nagios-plugins-basic nagios-plugins-standard nagios2-doc
Apache2 is automatically installed as a dependency of nagios2. Edit /etc/nagios2/apache2.conf, and uncomment the references to Nagios 1.x. This allows us to reach nagios at the same URL as before, rather than the Debian default of http://127.0.0.1/nagios2
Don't forget to reload the Apache config:
/etc/init.d/apache2 reload
Debian nagios2 is pre-configured to look for the htpasswd.users file in /etc/nagios2.
Add a user to the htpasswd.users file:
# htpasswd /etc/nagios2/htpasswd.users siemsen
New password:
Re-type new password:
Adding password for user siemsen
The bottom of http://nagman.scd.ucar.edu/nagios/docs/installing.html has a link to Configuring Nagios.
Of course, reading the HTML docs Configuring Nagios helps a lot. What follows is the first steps I took.
Do not uncomment the line that defines the check_nagios program. It's supposed to let the CGIs check that the Nagios daemon is running. After some Googling, it seems that the program wasn't updated for Nagios version 2, so it won't work. Also, see my check_nagios notes.
Set
authorized_for_system_information=*
authorized_for_configuration_information=*
authorized_for_system_commands=siemsen
authorized_for_all_services=*
authorized_for_all_hosts=*
authorized_for_all_service_commands=*
authorized_for_all_host_commands=*
If you set show_context_help to 1, Nagios will put a little question-mark on every CGI page. When the user clicks it, Nagios displays a window that says that context-sensitive help isn't available. Looks like this is a work in progress. To avoid annoying the user, I set it to 0.
First, in the main configuration file named nagios.cfg, above the inclusion of the checkcommands.cfg file, there are comments suggesting that I use the checkcommands.cfg file that came with the plugins. But I couldn't find such a file. There's a command.cfg in the plugins distribution, but it's syntax isn't right for this. Dunno what I'm supposed to do...
It complained that all the commands defined in minimal.cfg were already defined. They were, in checkcommands.cfg. I commented-out the reference to checkcommands.cfg in nagios.cfg, and then commented-out the definitions of notify-by-email and host-notify-by-email in minimal.cfg, and the verify command passed!
To allow use of "external commands" from the CGI scripts (like "Disable notifications for this host", Disable active checks of this host", etc.), you first have to do
check_external_commands=1
killall -HUP /usr/sbin/nagios2
This isn't enough - you also have to set permissions on the (named pipe) file in /var/lib/nagios2/rw/. I mostly followed the directions in http://nagios.sourceforge.net/docs/2_0/commandfile.html, but had to add an extra command to set the protection on the file itself to get it to work:
<as root>
groupadd nagiocmd
usermod -G nagiocmd nagios
usermod -G nagiocmd www-data
chown nagios.nagiocmd /var/lib/nagios2/rw
chown nagios.nagiocmd /var/lib/nagios2/rw/nagios.cmd
chmod u+rwx /var/lib/nagios2/rw
chmod g+s /var/lib/nagios2/rw
chmod o-rwx /var/lib/nagios2/rw
ls -ald /var/lib/nagios2/rw
/etc/init.d/apache2 restart
/etc/init.d/nagios2 restart
By default, Nagios uses the plugin named check_ping to actually ping hosts. The plugin named check_icmp is supposed to be more efficient than check_ping, as described in the Nagios book on page 88.
The check_icmp distributed with Debian nagios2 works as a drop-in replacement, and we now use it.
/etc/nagios2/ncar.d is our custom configuration directory. It is referenced from nagios.cfg. All *.cfg files placed here will be included in the nagios configuration.
I copied minimal.cfg to /etc/nagios2/ncar.d/ncar.cfg and changed the reference to it in nagios.cfg. Now I'm working with just NCAR-specific configuration commands. From here on in this section, the changes all refer to the ncar.cfg file. I began the development loop described above.
As of July 2007, /etc/init.d/ncar.d contains:
hosts/ - auto-configuration scripts
ncar-ap.cfg - WLAN access points
ncar-bgp.cfg - BGP service
ncar.cfg - NCAR general network devices
ncar-command.cfg - custom commands (how to execute plugins)
ncar-contacts.cfg - contacts
ncar-env.cfg - environmental monitors (aka weathergeese)
ncar-hostgroups.cfg - hostgroups defined
ncar-security.cfg - security hosts
ncar-servicegroups.cfg - servicegroups defined
ncar-services.cfg - services defined
ncar-ups.cfg - UPS's
ncar-wan.cfg - FRGP, UPoP, BiSON
servers-acd.cfg
servers-cgd.cfg
servers-comet.cfg
servers-eol.cfg
servers-fanda.cfg
servers-globe.cfg
servers-hao.cfg --- servers*.cfg are
autogenerated by hosts/populate-hostgroup.pl
servers-joss.cfg
servers-mmm.cfg
servers-nets.cfg
servers-ral.cfg
servers-scd.cfg
servers-unidata.cfg
servers-vislab.cfg
This is a description of how to autogenerate server configs
When Nagios raises an alarm, the UCAR NOC relies on web pages to explain what to do about it. There is a web page for each host that Nagios monitors, stored in the /var/www/noc directory on nagman. We monitor over 400 hosts, so there are more than 400 web pages.
The web page for each host describes:
Maintaining 400 web pages became a royal pain, so we wrote software to help. A bash script named /etc/nagios2/hosts/generate.sh creates new configs and web pages for all the hosts that Nagios monitors. When we make changes, we run the script and then restart Nagios as described below.
The bash script runs a Perl program named generate-nagios-configs.pl multiple times, once for each hostgroup. You can run generate-nagios-configs.pl for a single hostgroup, but we usually run generate.sh to create all the hostgroups at once. You can run "generate-nagios-files.pl -?" to see the syntax used to run the program.
Unlike other programs maintained by NETS, generate.sh isn't run by a cron job. You must run it manually whenever you want to change the hosts that are monitored by Nagios or the web pages used by Nagios.
The program creates config files and web pages. It writes configs into a temporary directory. To make Nagios use the configs, you must manually copy them to the live Nagios config directory (/etc/nagios2/ncar.d). Then you run a Nagios command to verify that the configs are legal, and then you restart the Nagios process. This manual process protects Nagios from problems should the program generate bogus configs. The program also generates web pages, which it writes directly into the directory that holds the active Nagios web pages (/var/www/noc/).
Each time the program is run, it reads a files containing a list of hosts that are to be monitored by Nagios. There is one such file for each hostgroup in Nagios. The files are named "host-*.txt", where the asterisk is the hostgroup name. For example, the file named host-upop.txt contains all the hosts in the UPOP hostgroup.
The program generates a web page for each host name found in the input files. Basically, each web page has the following sections:
The sections of the output web pages are automatically generated, but some can be overridden to provide more specific information. For example, there may be specific instructions for a given host. If there are no specific instructions for the host, the program looks for group-specific instructions. If it doesn't find those, it looks for a group-specific list of people to contact. This scheme makes it possible to use generic instructions for, say, all the access points, while allowing for very specific instructions for, say, sabae.
For a given host, if there is a "description" file, the program inserts its contents into the output file. For example, for the host named sabae listed in the host-security.txt, the program looks for a file named /var/www/noc/sabae-description.shtml to fill in the description section of the web page that it writes. If there is no file named /var/www/noc/sabae-description.shtml, the program puts nothing in the description section of the output web page. To fill in the instructions section of the web page, the program looks for a file named /var/www/noc/sabae-instructions.shtml, and failing that, for a file named /var/www/noc/Security-instructions.shtml, and failing that, for a file named /var/www/noc/Security-support-personnel.shtml. A file named "*-support-personnel.shtml" must exist for every hostgroup.
In the files that specify lists of hosts, there is one host name on each line. The program will use DNS to resolve the name into an IP address that is written to the Nagios config file. Following the host name is an optional severity for the host - 1, 2 or 3. If no severity is specified, the default is 3, the lowest severity. If there are two numbers following the host name, the first is the Solitary severity (when the host is down but its alternate is up) and the second is the Collective severity (when all the alternates are down at the same time). This scheme is meant to be used for cases where there are two paths to a host, like csu-router-bison-a and csu-router-bison-b. The program writes a boilerplate paragraph that explains the severity.
A slightly different case is that of "high-availability" hosts, like "mscan". For these hosts, there are several low-severity hosts that collectively make up "mscan". As long as one of mscan1, mscan2, mscan3 or mscan4 is up, then "mscan" is up. For this case, we monitor each of mscan, mscan1, mscan2, etc. Mscan is severity 1 and the others are severity 3. The program does nothing special for these hosts. We use simple severities in the "host-*.txt" files, with description files that have the same contents, explaining how the aggregate works.
To monitor a network node, you have to create a "host" entry.
In general, Nagios is designed to check "services" like web servers, DNS servers, mail servers, etc. When a service doesn't respond, Nagios does a "host check" to see if the host itself is up. Host checks use ping, because pings are the most likely thing to work, since they're implemented by the machine's kernel.
At UCAR, we used to use Openview, which didn't have a concept of "services". It simply pinged hosts. Nagios isn't really built to do that. To make it do it, I use pings for both services checks and host checks. So Nagios does ping "service checks" and when pings fail, it does a ping "host check".
From: John Hernandez
Date: Fri Sep 14 2007 - 16:53:04 MDT
To: ne
Subject: Nagios BGP service checks
Hey NE,
I implemented a new form of BGP service checks for Nagios. This should eliminate the problems we had with routers being unresponsive when faced with many simultaneous SNMP queries, and the Nagios event logs filling up with junk.
The method uses a new (to us) concept in Nagios called "passive service checks." A cron job (running as root) runs /usr/lib/nagios/plugins/passive_check_bgp.sh every 60 seconds. The program collects information from the routers by connecting to each router once (roughly speaking). The program submits the results to Nagios. Nagios does not initiate any checks itself, hence the term "passive."
Nagios is configured to continuously scan its "command file" for these passive BGP service updates. If BGP data is not refreshed within 120 seconds, Nagios will issue a stale data warning (service turns yellow). Other than this, the functionality is basically identical to what we had before. The NOC shouldn't notice any difference other than fewer transient false positives.
-John
So for example, to change the number of routes that Nagios expects the FRGP router to receive from the UCD, log on to nagman and edit /etc/nagios/bgp.d/frgp-gw-1-bgp.conf and change the "UCD" line. Then wait for 60 seconds for the cron job to run again, and Nagios should reflect your changes.
From: John Hernandez (jph@ucar.edu)
Date: Wed Oct 17 2007 - 15:53:40 MDT
To: ne@ucar.edu
Subject: Nagios "Chassis" serviceHey NE,
Nagios now checks the power supply status, fan status, and module status for all NCAR and FRGP Cisco 6500s. The service is called "Chassis". The plugin output includes the number of checks performed, chassis serial number, and if applicable, a list of any failed components.
-John
From: John Hernandez (jph@ucar.edu)
Date: Wed Oct 31 2007 - 16:39:02 MDT
To: ne@ucar.edu
Subject: Yep, more Nagios updatesHey NE,
A couple of new Nagios tweaks to report:
-John
- The "Interfaces" service has been expanded to cover the fiber-capable modules (those with GBIC/SFP/Xenpack ports, including supervisors) on campus core switches. This should allow us to detect when half of an etherchannel uplink is down. For now, the modules to be checked are specified in the Nagios configuration file. We also talked about having it ignore ports that are in spanning-tree portfast instead. Any opinions one way or the other?
- I created some "Servicegroup" definitions. You can now click on Servicegroup Summary in the left pane to see a summary of BGP, Catalyst Chassis, DNS, HTTP, Interfaces, and OSPF service checks.
- For Pete - to keep things more manageable, I broke out the services configuration into its own file, ncar-services.cfg
To make Nagios be intelligent about figuring out what's wrong, you should set the optional "parents" field in each host definition. You can Nagios's "Status Map" to see a fairly useless drawing that shows what depends on what. I was initially confused about whether this should reflect Layer 2 dependencies or layer 3 dependencies. The revelation is that Nagios doesn't care - it uses dependencies to be smart about what's down. It doesn't use them until a node is down, then it looks up the dependency tree to figure out what's really down. We implement dependecies at Layer 3. The generate-nagios-configs.pl script uses traceroute commands to figure out how to set the parents values.
We are currently configured to ping all hosts every 3 minutes.
The graphics capability in Nagios is rudimentary compared to OpenView. Actually, it's useless because there are too many things on the map. If I can fix this, then it'll be worth visiting http://www.nagios.org/download/ to get nicer icons.
cat >/var/www/nets/test.php
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>NETS Web</title>
</head>
<body bgcolor="white">
PHP is at version <?php phpinfo() ?>
</body>
</html>
^D
fl-nagman$ keytool -alias nexsm -genkey -keystore nexsm-keystore Enter keystore password: ucarnagios What is your first and last name? [Unknown]: Pete Siemsen What is the name of your organizational unit? [Unknown]: Computational & Informational Systems Laboratary What is the name of your organization? [Unknown]: National Center for Atmospheric Research What is the name of your City or Locality? [Unknown]: Boulder What is the name of your State or Province? [Unknown]: Colorado What is the two-letter country code for this unit? [Unknown]: US Is CN=Pete Siemsen, OU=Computational & Informational Systems Laboratary, O=National Center for Atmospheric Research, L=Boulder, ST=Colorado, C=US correct? [no]: yes Enter key password for(RETURN if same as keystore password): fl-nagman$
The /etc/nagios2/ncar.d/ncar.cfg config file has a contact named nagios-admin. To make email sent to nagios-admin be forwarded to siemsen@ucar.edu, I added a line that says "nagios-admin: siemsen@ucar.edu" to /etc/aliases, and then ran the newaliases command.
To start the Nagios daemon from scratch:
(as root)
/etc/init.d/nagios2 start
With the daemon running, to view the user interface, web to http://nagman.scd.ucar.edu/nagios/
To cause the nagios daemon to re-read its configuration files:
/usr/sbin/nagios2 -v /etc/nagios2/nagios.cfg
killall -HUP /usr/sbin/nagios2
The UCAR NOC uses "severity levels" to determine the importance of service problems. There are at least three places that the concept of "severity" is defined:
Luckily, these three definitions are similar!
Nagios WebUI plays a "horn" wav file when a system goes down. This is specified in the cgi.cfg and is the primary method CPG relies on to know when they need to respond.
We use the Nagios notification system to send a Win popup message to the main CPG workstation when a machine recovers.
Nagios rotates its own logs daily, which are located in /var/log/nagios2. The current log is nagios.log. Old logs go into the archives/ directory. These can be browsed from the web interface with the event log viewer.
To check an SNMP variable with Nagios, you can use check_snmp or you can write your own plugins. The latter may be better, because the name of the plugin provides some documentation, and the plugin itself can check the range of values. If you use the latter choice, steal like mad from snmp4nagios.
as root
cd /usr/src
gunzip snmp4nagios-0.1.tar.gz
tar xf snmp4nagios-0.1.tar
rm snmp4nagios-0.1.tar
cd snmp4nagios-0.1
more INSTALL
mkdir -p /usr/local/nagios/perflog/rrd
mkdir -p /usr/local/nagios/perflog/img
mkdir -p /usr/local/nagios/libexec/snmp4nagios
At this point, you're ready to "make install", but it'll fail because it can't find rrd.h. You have to install rrdtool-devel before you can continue.
SNMP traps can be handled with a Nagios "passive service check". Need to investigate this.
Jeff installed the Net-SNMP package on nagman, and sent me some notes about it. He suggested that I consider these:
VERY GOOD: http://www.samag.com/documents/s=9559/sam0503g/. I made a copy of it at SNMP-traps.html.
See syslog-nagman.shtml and net-snmp-nagman.shtml
Some browsers "hang", which means they initially work fine, but after a while, when you try to leave a page, the browser doesn't go to the new page. Instead, it turns the cursor to an hourglass, and when the 60-second timeout happens, then the browser changes pages. This happens with Firefox and IE under Windows XP on BRONCO, the NOC Nagios machine. I'm trying Opera on that machine as of 2006-03-03, hope it helps.
http://www.unixreview.com/documents/s=9602/ur0503l/ur0503l.html
Older article: http://www.linuxjournal.com/article/6767
Maintaining config files by hand can be a pain. Monarch is an open-source program that edits Nagios config files. It's available on SourceForge at http://sourceforge.net/project/showfiles.php?group_id=130574