Do you monitor what should only work when everything else fails?
Backups and UPS, the forgotten ones
I think anyone who has worked with an IT infrastructure for some time has ever encountered one of these problems:
The backups had stopped months ago and nobody had noticed (because of a change of password, a change of location, filling up the target disk, failure of the NAS disks, etc.)
UPS (Uninterruptible Power Supply) do not last a single sigh when the power goes out: after years without power cuts the batteries no longer work or the system has been charging uncontrollably until they are consumed immediately in the event of a power cut.
The machines and equipment that we have in high availability and that must act when their "masters" fail are off, out of date, have a full hard drive, etc.
Luckily, as always, the monitoring system comes to the rescue. We use our platform, based on Zabbix 5.0 and Grafana 7, but what we have here also works for other platforms, although it may be more complex or not as visual.
Monitoring of backups
There are many approaches to managing backups. From the typical script that runs in cron every night to platforms like Veeam Backup.
For the first case and with most platforms, the usual thing is to finish the script or configure the tool to send an email with the result. This, if you have few servers or computers, is fine, typical mail to check with the coffee cup in hand. But as soon as you have a certain volume or you prefer to save 5 minutes every day, the best thing is to have a tool that sends you an email ONLY if it has failed. Besides, what happens when you don't get that email? Has the email failed or hasn't been backed up? Who checks those emails when you're not there?
The approach to be applied is the same as for any monitoring, only to warn if there are problems. We can put our monitoring tool to read emails or if the script is ours, simply send a Zabbix trapper or via API a 1 or 0 to our monitoring platform. From it we can indicate how long to wait before giving an alarm if we do not get a 1 or a 0 in a certain time, etc. Total flexibility.
Veeam Backup works by email normally, but the truth is that reading emails with any platform, although feasible, often causes problems. Fortunately this platform also works with SNMP traps, warning us of the result of any work. We configure it immediately to send the traps to our Zabbix (more information at: https://helpcenter.veeam.com/docs/one/alarms/snmp_traps.html?ver=100)
Raw traps are like this:
So we process them with our Zabbix through an item for each job (job, identified by #JOBNAME) in the following way (we do it with self-discovery but you can do item by item):
In this way we can also visualize in a "calendar" way if there has been any problem and do it in an independent way to the system or backup systems we use: we can visualize and alarm in a common way an own script, Veeam, etc.
Network Attached Storage (NAS) monitoring
It is very common for backup software or scripts to dump the copies on NAS equipment from manufacturers such as Sinology or QNAP and these are then responsible for dumping copies on an external USB hard drive or cloud services such as Amazon Glacier. Therefore, it is vital to monitor its proper functioning because, although the backup systems will warn us of full hard drives or unavailability of the NAS, it is very important to anticipate these failures, warning of SMART failures in the disks or the foreseeable lack of capacity to anticipate the purchase of more storage. Most of this data is provided by the device through SNMP and we can find several templates accessible on the Internet through the community.
In addition, you can do fun things, such as monitoring for nighttime spikes in network activity, as shown in the scorecard graph:
Monitoring of Uninterruptible Power Supplies (UPS)
They come in all shapes, sizes and powers but equipment from manufacturers such as RIELLO or APC are the ultimate barrier to a power outage. Not only do they keep the system up in case of a temporary cut (the most common ones), but they also allow for an orderly shutdown in case it lasts longer than necessary. Therefore, they are vital to guarantee the availability of services both external and internal so that people can work, industrial machines can communicate their data, etc.
That is why it is very important to monitor your health, in addition to doing controlled tests at least once a year. Most enterprise UPS offer their monitoring through SNMP and fortunately there is a standard MIB (UPS-MIB) that most manufacturers respect. Thanks to this, we can obtain fundamental data such as the load of the UPS, the estimated duration of the batteries in the event of a cut, alerts when the supply is down, as well as the consumption -economic data and often an indicator of problems- of whatever is connected to it:
It is important to have tools and elements for when things are not going well, that is clear; but it is also very important to have the peace of mind that those elements are going to be there when the problems occur and for them, as we have seen, it is not only important to monitor them but also to do controlled tests periodically; tests for which, as always, our monitoring system will be the one to tell us if the service is affected or not and for how long.