The people who come from the world of systems have some particularities but in general two very common: want to install and configure as much software as possible ourselves but never have time for anything. In monitoring this translates into thousands of installations around the world of software such as Nagios, Zabbix, Icinga, PRTG, OP5, etc.. underused, misconfigured and, sadly and in many cases without fulfilling its mission, which is to detect problems before they affect users or cause greater damage.
Why does this happen? In many cases the sequence is as follows:
I would love to ride Zabbix, but I don't have time (in this step we can be stuck for many weeks).
One day the systems crash and my boss says to me, "How did we find out? Don't we have a monitoring platform?"
I take some time and finally install Zabbix. Following a couple of tutorials I get to configure with default templates my servers and hopefully even some switch.
I put in the second screen of the computer those graphs that seem to me more interesting. I go home and open a beer.
An alarm notification! It seems that some server has some CPU peak, I'll check it tomorrow. As I watch a chapter of "The IT Crowd", I get about 20 similar alarms. It must be something normal.
In the morning when I arrive at the office, my mail is full of alarm notifications from many of the servers.
After spending some time researching, I deduce that it's because of the burden of night backups and the automated restarts of certain processes. I think I should investigate if there is a way to stop these alarms, ah, but I don't have time again.
One day the systems go down and my boss says to me, "How did we find out? Don't we have a monitoring platform?"
And I say to him, "yes, look, checking the folder where I have filtered all the notifications, here it is, the system warned us".
My boss, "shouldn't something have gone red?" He looked at my second screen and it was already completely full of alarms, "is that normal?
"Yes, that's normal”
No, it's not normal
No, it's not normal. An alarm is something that must "wake up" someone, it must represent a problem. Several levels can be established, distinguishing who should be raised and whether it is something of immediate attention or something that should be recorded for periodic review. The solution, of course, is not to silence all the alarms: it involves customizing the templates to our reality, always with the focus on the service effect and having a hygiene control on the alarms, periodically reviewing which are being useful and which usually represent false alarms or have "flapping" to refine them.
But in no case should it be normal for our system to tell us that we have alarms and ignore them because they normally do not want to say anything. To correct this it is important to be advised by experts, with an appropriate monitoring methodology but adapted to the particular needs of each and relying on the right tools to get the most out of them and that monitoring is a faithful ally in the operation of systems and applications. Monitoring is not only configuring the tools, but also knowing how to scale with them and adapt their use to the technological reality of each company.