Wednesday, May 16, 2007

nagios

Another application I would like to draw your attention to is nagios. Nagios is an open source monitoring tool that has all the bits and pieces you need to be able to keep tabs on your environment.

One of the first complex applications I made was a suite of monitoring programs I called Imp (Never publicly released) and for it's time it was nice and all, but it was all command line driven and extending it wasn't the most friendly thing. Then one year we came across Nagios and we found it was an exceptional replacement across the board and we phased Imp out and moved to Nagios. I'm bringing this up just to highlight an important lesson - You will come up with a brilliant tool, and then someone will come up with a better one - you need to retain objectivity and be able to know when it's best to put your pride asside and move to the better solution.

So back to nagios - the tool has some great logic built into it, by supporting both active and passive checks you can monitor anything, even behind NAT. The main genius of nagios is the ease of writing checks. They defined a very easy to use standard (using STDIN, STDOUT) for your checks and they let NRPE/NSCA do all the network heavy lifting. So you have a script you can run on it's own and it'll tell you if what you are checking is working or not, all in all it makes building and debugging new checks a snap.

This was a great example of elegant design, by making it so easy to make checks they pushed us to make a decision -> If any outage occurs that we didn't catch via monitoring than you must write a check that will find it next time. This policy has been in place for a number of years and we almost never experience any outages of any kind without the monitoring service letting us know about it.

Another note I'd like to bring up is the value of using RRD heavily (with Cacti if you'd like) and Nagios. Graphs are great, but they tend to just grow in the corner and no one looks at them. Nagios gives you an easy mechanism to create alerts that will go off that will make an admin go look at those graphs - Ideally they will make that happen long before anything is service impacting.

No comments: