Sunday, December 11, 2011

Monitoring Guidelines

  • Keep it simple stupid.
    • Start with simple monitoring, like PING to make sure that the host is up and standard checks like http, smtp and so on to make sure that standard services are up. Going from no monitoring to basic monitoring is a huge step and many organizations do not have the processes to handle more complex monitoring.
    • The next step is disk, CPU and memory on hosts. On network devices are port load, cpu load, network link.
    • Third step is to dig into bussiness critical applications and services.
  • Small iterations. Do not try to build a top of the line monitoring solution from day one. You will never leave the startup phase.
  • Let the monitoring solution pull the status instead of sending the status to the monitroing solution. This avoid complicated rules when different types of information is sent to the monitoring solution. So avoid sending SNMPtraps.
  • The monitoring solution is NOT a trashcan where to send tons of uninteresting garbage. It is far to common that HW vendors recommends to send thousands of unnecessary SNMPtraps to the monitoring solution and just a few is interesting. It is a nightmare to create the ruleset to figure out what is interesting, especially if there is dependencies where one message is interesting if another message has been sent before. The documentation is a badly written  MIB on a couple of hundered pages. In almost every case I’ve run into with this approch the implementation never ends and test cases are hard to create. When the systems are in production I can bet on that a critical event will occour which has not been taken care of and the production will stop. Managers will be upset, vendors will blame each other and customers will be angry.
  • Let the status be availiable easily:
    • via standard APIs, Perl and Bash is the most common.
    • SNMP via SNMPget instead of SNMPtraps
    • Status stored in a database, the monitoring solution can run SQL quries to get the status.
    • Commands the can be runned by the monitoring solution and the output parsed, or even better, exit codes are used and documented.
  • Normally it is not a good idea to read a log file to understand the status of of the software.
http://www.it-slav.net/blogs/2009/08/30/general-systems-and-network-management-guidelines-part-1/

No comments:

Post a Comment