|
|
| Browse NETS topics: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z | |
This guide describes how we use Nagios at NCAR. It is meant for the NCAR NOC staff, but also applies to NETS staff members that run Nagios.
Nagios can be run on several machines at once. In the NCAR NOC, we always keep a Nagios session running on Bronco, a Windows machine dedicated to just Nagios. Bronco's only job is to make noises when Nagios detects network problems. Other than controlling the noise volume with the knob on Bronco's speakers, please leave Bronco alone. Bronco's video output may be displayed on the projector screen, but we don't interact with Nagios through Bronco. Run your own copy of Nagios on one of the other machines in the NOC.
To start Nagios, open a web browser and go to http://nagman.scd.ucar.edu/nagios. Log in to Nagios as yourself. If you don't know your password, ask Pete Siemsen or John Hernandez. Unlike Openview, you have to log in to your Nagios account. This allows Nagios to keep track of who did what as various users interact with Nagios.
When you start Nagios, you'll see the Nagios home page, which says "Nagios" in big letters at the top of the screen. There are several menu items on the left-hand side of the screen, which appear on all Nagios screens. They are labelled "Home", "Documentation", "Tactical Overview", etc. Please click on Hostgroup Summary.
The Hostgroup Summary page is titled
As you browse around the Nagios pages, you may discover an irritating feature - when you're on some pages, your browser's "Back" button won't work as expected. Instead of taking you to the previous page, it may take you to the main page. In general, it's better to navigate around the Nagios pages using the buttons on the left-hand side of the Nagios pages and avoid the "Back" button.
Nagios notifies you about network by turning things red and making noises. Every 60 seconds, Nagios asks the Nagios server (nagman) if anything is wrong. If something is wrong, Bronco will make a noise, usually an irritating old-car-horn "a-oooo-gaaa" sound. Your machine may also make a noise, depending on whether your browser supports noises, whether your speakers are turned on, and whether your volume is turned up. If your machine makes a noise, it will happen a bit earlier or later than Bronco, since each Nagios session uses it's own 60-second timer.
When there is nothing wrong, Bronco and all other Nagios sessions are quiet. When one thing or more things are wrong, Nagios will make a noise every 60 seconds.
When Nagios makes a noise, and you want to figure out why, go to
Host Problems (over on the left).
You'll see a line for every machine that has a problem.
Next to the machine's name you'll see some icons.
One icon looks like a man shoveling dirt, one looks like a cloud,
one looks rike a magnifying glass, and another is some Z's like
someone is sleeping.
You don't have to remember exactly what each icon means - if you
put the cursor over an icon, Nagios will display what it means.
They are easy to learn.
For example, I thought the man shoveling dirt meant
Nagios only makes noise about problems that haven't been acknowledged in some way. When looking at the list of Host Problems, the one(s) that have been acknowledged are the one(s) that have the icon that looks like a man shoveling dirt or the icon that looks like some Z's. Look for ones that don't have either of those icons - those are the ones that Nagios is making noise about. When you find one, click on the machine name to go to the host page for that host. You'll see some "Host State Information" and some "Host Commands". You should also see a red folder icon on the right side of the page titled "Extra Host Notes". Click on that icon to see the "Info" web page for the host. This is the same "Info" page that we used in the old Openview system - it should tell you what to do about the problem. Once you've called someone or sent email or otherwise handled the problem, you can acknowledge the problem as follows.
When Nagios makes a noise, you'll want to make it stop. There are two ways to do it short of turning down the volume on the speakers:
When you Acknowledge the problem, you're telling Nagios that you have done something about the problem and you want Nagios to be quiet about it. This works well for most network problems. When the host comes back up, Nagios will quietly turn the host green. If the host goes down again, Nagios will make noise again, and you'll have to acknowledge it again. We use the Acknowledge a lot in the NCAR NOC - it's the most common way we interact with Nagios.
To Acknowledge a host, navigate to the Host page for the host that's down. Under "Host Commands", click on "Acknowledge this host problem". Nagios will prompt you to type a comment. Nagios saves the comment so that later, others can see why you Acknowledged the problem. When Nagios saves the comment, it also saves the name of the person who made the comment. This is why it's important for you to log in to Nagios as yourself. To see the comments describing what has been done about a host, go to the Host page for the host and scroll down to the "Comments" section at the bottom of the page.
Acknowledging a problem works well for most problems, but not when a host goes down and up several times. In that case, you hear noise and have to acknowledge the problem each time the host goes down. To handle this kind of problem, instead of Acknowledging the problem, you "Schedule a downtime" for the host. You supply a start time and an end time. When the downtime expires, Nagios will start monitoring the host again. For example, suppose Nagios starts making noise because "fileserver" is down. You call DSG and learn that they plan to work on fileserver on and off until 6:00pm, and it'll be up and down a lot until then. You schedule a downtime starting an hour ago and ending at 6:00pm. Nagios will not make any more noise about fileserver until 6:00pm. Often, we schedule downtimes in advance, before Nagios notices any problems.
To schedule a downtime, navigate to the Host page for the host that's down. Under "Host Commands", click on "Schedule downtime for this host".
To cancel a scheduled downtime, navigate to the "Downtimes" page. Find the downtime and select the trash-can on the far right.
Other Nagios menu items...
This shows all the comments that have been entered. It's a way to see what the NOC staff has been doing, for all machines, in time order. You might use it to answer the question "what did the NOC staff do with Nagios last shift?".
This shows all events that Nagios has noticed. It is something like Openview's Alarm Browser with the All Alarms category. This shows minor and major events, so it's probably more events than you want to know about.
This shows a subset of all the events. You can use the boxes at the upper right to set filters to see only a subset of events. Remember to click "Update".
This is something like the Alert History page, but in table form.
To learn more, please browse around the Nagios pages and try things out, but avoid the "Host Commands", which can change the way Nagios behaves. If you have questions, please ask Pete Siemsen or Kelly Meese. Of course, if you can suggest ways to improve our use of Nagios, please let us know.