In summer it is very common that the available technical staff decreases due to holidays, summer time, etc. but our business does not stop. The problem that our technicians are not at the office or it is difficult to contact them is that nobody else in the company is able to translate if a control panel like the one in the photo means that everything is going well or we should worry about those peaks in the graph of the medium. Both for the manager or responsible as for the technician is better to avoid these doubts, which usually end with a call from the office to the beach. If you thought that this article was about how to design dashboards optimized for the reflection of the sun ... will be for another time.
There is nothing wrong with having technical dashboards, on the contrary, it is a very useful tool to locate relationships and problems in our systems and services. The problem arises when we only have technical dashboards, usually referred to our infrastructure, increasingly complex: kubernetes, cloud, legacy, serverless, multicloud, and so on.
Monitoring should actually always be carried out in the opposite direction to the usual: that is, starting from "above", , closer to the business than to the speed of rotation of hard disks. The first step to effective monitoring, which provides value from the first metric is:
On the one hand, agree with the business the metrics they will need to measure the service, even in many cases clarify which is the service itself and which services are more or less priority and what is expected of them: we are talking about numbers of users, number of registrations, failed registrations, number of carts, sales, etc.. Here it is of great help that normally the directors do not understand of technicalities and the technicians of businesses, reason why the things, curiously, tend to be clearer. At this point, many highly correlated metrics tend to emerge, such as "cloud infrastructure cost" with "number of instances of autoescale", although from very different perspectives, but that, if we use the first indicator before or in conjunction with the second, anyone will realise that it is important that it does not rise too much. In short, this step helps to put many indicators into context and to ensure that, although they are different metrics, there is a common perspective aimed at efficiency and success.
• On the other hand, establish some metrics that mean that "the service is available": that our website load in X milliseconds, that our APIs are accessible, that our users can login, etc. but taking advantage of what was done in the previous step. No logins or an increase in the number of failed registrations can be a symptom that something is going wrong. From here, in next phases, we will be pulling the thread and going down, with new dashboards that will help us locate the cause of the problem: but the key is to detect problems and not only possible causes. This is fundamental to avoid being constantly attending alerts that don't really affect the service.
In both points, in a first phase we should try to choose a few really key metrics and then move forward little by little, because if we try to collect too much data at the beginning is likely to advance the product or service faster than us or generate a complexity added to the visualization that will prevent its use:
It is much more difficult to do it the other way around, first collect all the metrics we can think of, even if we don't know what they mean or how they impact our service, default alarm templates, etc. and then start pruning them. Surely, we will see to visualize all that information as a titanic task for which we will never have time since we will be constantly attending CPU alarms in development servers.
Once we have reached this point, our high-level dashboards should be very simple and reflect those business and service status indicators that we need and that anyone can understand and really detect if there is a technical problem or the problem has been to turn off the Adwords campaign in August. Tools such as Minerva, based on Zabbix and Grafana, help us define the availability of services based on different types of alarms.
There are a number of additional advantages to working in this way:
Savings in monitoring and development infrastructure costs: collect and store metrics that only a database analyst with 10 years of experience is able to understand has a very high hardware cost (including the number of screens to install in the office), as well as development and adjustment of tools. We guarantee that after this process you will also need a tool to monitor the monitoring platform itself.
Better understanding of the business and working together with the technical part, something beneficial for both parties that also helps to create a collaborative and trusting environment. A company in which only those with systems have dashboards in real time... they would be probably working in a separate room or basement. We have to put an end to this: if we don't stop saying that technology is key to business, let's put it into practice!
The thinking of the entire team will be more service-oriented, SLAs oriented, and properly prioritized and alarmed. It is not uncommon to find cases in which a user calls "Web is not working" and the technician responds "I see all the servers up and green", to which the user replies "Can you try from your browser?", "OK, we're going to check".
But don't worry, even if you currently have more metrics per minute than your Apache or Nginx visits, it's a great exercise to consider this from zero: What conditions must be met to consider that my web service is up? And I'm sure the answer goes far beyond a ping, but it doesn't reach the number of users who connect to your website from a city of more than 20,000 inhabitants with a beach. We think this is good homework for this summer 😊