Improvements and news in Zabbix 4.4December 3, 2019
Server monitoring with iLO (HP), iDRAC (DELL) and IPMI technologies with ZabbixJanuary 12, 2020
Muutech adds to its product anomaly detection, supported by ICEX InvestInSpain
Thanks to the "Foreign Companies Investment Program in R&D Activities 2019", of which Muutech has been a beneficiary, we have been able to expand the capabilities of our product based on Zabbix and Grafana, Minerva, to include automated detection of anomalies using artificial intelligence algorithms, specifically Deep Learning techniques. In this post we are going to tell you a little bit about how we have done it and the results obtained.
To increase the predictive maintenance capabilities of Muutech's platform, it is necessary that it is able to detect, automatically, deviations or out of the ordinary behavior in any of the metrics or indicators that it is able to collect, whether they are temperatures, vibrations or number of users connected to our systems and servers. The system must be able to learn which is the normal behavior of a certain metric so it must be "trained" with historical data that is considered "acceptable". Later, it will process in batches the new data arriving to the system and it will warn in case these new data are out of the "normal" and "acceptable" with a probability X higher than a user configurable threshold. The system can also receive feedback from the user on whether or not he or she has "succeeded". In other words, the user will indicate to the system if some patterns that he has detected as out of the norm are in fact acceptable but were simply not captured in the training phase. With this the system learns and improves, increasing the quality of the alarms. This is what is known as anomaly detection.
If we look at the graph below and just stick to it, using human intelligence we can see that the anomalies marked are a priori correct. For this that for the human being is more or less direct in generic cases, the way a machine does it is through artificial intelligence, Deep Learning in this case.
Therefore, when it came to equipping our platform with an additional generic anomaly detection module, we started to work and research, with the help of a technology centre, on artificial intelligence and data processing techniques to address the problem.
We tested a couple of models, in this case K-Means, based on clustering techniques and Autoencoder based on neural networks, both of which are unsupervised automatic learning since they are unlabeled information, with good results. Essentially what the system does is to train these models with normal metrics windows and then as new data are taken, analyze the prediction made by the model with respect to what has been measured. Simplifying it a bit to make it understandable, if the difference between the prediction and reality is very high, which is known as reconstruction error, an anomaly is detected.
Obviously, the models used work better or worse depending on the type of signal used (whether it is stationary, periodic, etc.) and the time windows chosen. One of the points with which to follow the project is precisely that, knowing how to choose the most appropriate model for each situation, since what we want is to achieve a more or less generic system. Another key point to continue the research is the possibility to feed the models with several different metrics and we can for example detect anomalies in industrial equipment from a combination of data such as temperature, vibration or quality of the parts produced.
After the first tests, we decided to test the system with something real, where we knew there was an anomaly and where we had enough data for training. To do this, we used the open downloadable data available in this case from the City of Madrid, where we can obtain the historical data of measurements of Sulfur Dioxide (SO2) measured in this case in the Plaza del Carmen, as the most central point. We feed the Zabbix of our cloud platform with this historical data:
Viewing the graph, intuitively, we see that there is an anomaly around March 27, 2019 where the levels drop significantly. To see if our system detects it as well, we trained it with the previous three months and put it into analysis. The great advantage is that for the user of our platform the configuration of the models, the training, etc. is done from the platform itself as one more monitoring item, so he does not need advanced knowledge or to interact with different platforms or exporting and importing in CSV constantly. This additional item will indicate to the artificial intelligence analysis platform which signal should be analyzed, which period of time is the training one, as well as additional parameters of the model, and it will dump the analysis performed, in the form of anomaly probability, on which then an alarm can be established as with any other type of metric.
This is the result of the analysis, in this case applying Deep Learning with Autoencoder:
In red, the sulphur dioxide values measured in March 2019 and in green the probability of an anomaly detected by the platform. We see that it is still necessary to refine the system since it detects some anomalies in the first days of the month where in principle there are none, although these are punctual peaks, but we clearly see that the system detects recurrent anomalies from the 15th and these are practically constant from the 26th of March, as we had discovered visually. When we checked the data, we discovered that the fines for accessing Madrid Central started to be applied from March 16, but it seems that the effect was not noticed until a few days later, and we do not know if this is due to the sensors, the behavior of the gas itself or if the first days were not really effective. This case, as well as in general terms, reflects the behavior of the platform, which will warn of an anomaly, but its interpretation will be given by the user, who is familiar with its process and equipment, relying on the rest of the data through its visualization through the powerful platform based on Grafana.
Artificial intelligence helps us to work in a predictive way and above all to analyse the data for us as we see in the examples in this blog, which allows us to be calm in the knowledge that the system is constantly watching and supervising without getting tired and warning us of any problem, so that we can then act and make decisions with the support of the machines. With this we can discover the problems before they happen (stoppages, drops in quality and production) or that these really impact, helped by technology and supported by data.