MuutechDegradados_Web-256X60MuutechDegradados_Web-256X60MuutechDegradados_Web-256X60
  • IT Monitoring
  • Industrial Monitoring
  • About-us
  • Blog
  • .Unión Europea
Languages
  • Español
  • English
DEMO
BFFOOD
Muutech selected for Business Factory Food accelerator
August 27, 2019
Advantages of using a SaaS cloud monitoring system
September 12, 2019
September 5, 2019

 

High availability on your monitoring servers

 

 
Monitoring systems are a fundamental part of any technology company today. Therefore, it is very necessary that they are always available to be consulted. If this is not achieved, any problem with the service can result in a catastrophic failure for our company.

The traditional way to solve this would be to detect the system crash and launch a new server manually, with the configuration that this entails. Muutech set out to automate this process. Due to the characteristics of Zabbix, it is not possible to have several active servers at the same time. So we have to look for a solution that works for servers that work in active-passive, that is, while one server has the service ready and launched, the other is ready to launch when required.

The main objective is to create a network so that the Zabbix service does not suffer availability problems. For this the first step is to see the possible problems to solve:

  • A complete server crash.
  • Network crash of the server on which Zabbix is running.
  • Downtime of the offered service due to an error in its execution.
Each of these faults must be treated differently from each of the nodes in the network. And, of course, in a way that is completely transparent to the user. To achieve these objectives it is necessary to design an architecture capable of responding quickly and effectively to these errors, as well as preparing the different machines for them.

This solution is effective for a network that goes from three nodes to several hundred of them, so it is completely scalable. But to simplify the explanation, let's put the simplest example, which would be with three nodes, but then explain how it would work with more.

Simple Solution

To achieve high availability, we opted to use the technology of Zookeeper, sub-process of Hadoop. It is a widely tested and contrasted technology, used by several large companies (Youtube, eBay, Yahoo, among others) and software (PTC ThingWorx, Patroni, etc.). This gives us an idea that it can be very effective.

First of all, we need to install Zookeeper on all three nodes of our network. And configure them in such a way that they replicate the information between them, to avoid data loss. We need at least three nodes in the network to get a quorum through a healthy cluster.

A quorum is the set of nodes that are replicated within the same cluster and the same application. All these nodes have the same copy of the configuration file (explained below), since they must be perfectly synchronized between them.

A healthy cluster is achieved when there are three or more servers within a quorum, since in this way, of losing the connection with a node, a "majority" of nodes with synchronized data can be maintained. To see it more clearly we expose the following example: we have two nodes, if the connection between them were lost, when they were connected again, both would have different data. In that case, what would be the correct data to synchronize? There is no simple way to solve this problem, for this reason, the minimum number of nodes to have a healthy cluster is considered to be three.

This is achieved with the following steps:

1. We write the following in the configuration files:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/zookeeper/data
clientPort=2181
server.1=zoo1ip:port1:port2
server.2=zoo2ip:port1:port2
server.3=zoo3ip:port1:port2
Where:

  • tickTime: The time it takes to send each ping.
  • initLimit: It is the number of ping that has the first phase of synchronization.
  • syncLimit: It is the maximum number of pings that can pass between a ping and its answer, if it is exceeded, the connection is considered lost.
  • dataDir: Where the information of the nodes of Zookeeper is stored.
  • clientPort: Port through which the clients are going to connect.
The servers are described in each node, including itself. Above this, we need a file called "myid" in the folder indicated in "dataDir". This file will contain a unique number within the Zookeeper cluster (usually the one corresponding to it is put in server.X).

2. We will now proceed to run the servers, either manually or through a service. The description of the service is given below:

[Unit]
Description=Zookeeper Service

[Service]
Type=simple
WorkingDirectory=$ZOOKEEPER_PATH_OF_INSTALLATION
PIDFile=$ZOOKEEPER_PATH_OF_INSTALLATION/data/zookeeper_server.pid
SyslogIdentifier=zookeeper
User=zookeeper
Group=zookeeper
ExecStart=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh start
ExecStop=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh stop
Restart=always
TimeoutSec=20
SuccessExitStatus=130 143
Restart=on-failure

[Install]
WantedBy=
This way Zookeeper will start automatically.

3. Now, in the servers where the Zabbix service is installed, we install our client, which, with a small configuration, is already in charge of maintaining synchronization with the Zookeeper servers.

The resulting architecture in a case would be something like that:

 

 
HA-architecture-1
 

 
In it we see how our client Zookeeper would be in charge of maintaining the connection with the servers. In this way, the passive server would know the status of the active server at all times, in case it was necessary to be activated.

Now let's see how we would act in each of the failure cases mentioned above:

In the event of a service failure due to an error in its execution, we would see that the Linux service system (systemd) tries to relaunch it, if it does not succeed, our client will stop the service, inform the Zookeeper server and this in turn, inform the passive server that has to be activated.
 

 
HA-architecture-2
 

 
In the event of a complete server crash, the Zookeeper server would detect a loss of connection from the active server, so it considers that the service is down. Because of this, it would inform the passive server that it has to be activated.

In the latter case, a network crash of the server on which Zabbix is running, our client checks against another server if it has crashed its network interfaces. If so, for your service so that, when you have a network again, both servers are not active. The Zookeeper server and the passive server will work exactly the same as in the previous case, so there is an active server.
 

 
HA-architecture-3
 

 
If there is more than one passive server, all of them would be queued to wait their "turn". Therefore, if the active server were to fall, one of the liabilities would take its place while it would have other liabilities pending from its status.

Conclusion

As you can easily see, this solution results in a total improvement in system availability of the monitoring platform, as the system is able to react automatically (failover) to various problems: network loss, server or service downtime, etc.
 
Share

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Subscribe to our Newsletter

Muutech
Monitoring Solutions S.L.

Ed. Consorcio Zona Franca OF1 - Oficina C2
Área Portuaria de Bouzas s/n
36208 Vigo (Pontevedra) - SPAIN

Phone: +34 886 311 711
info@muutech.com

Follow Us!


El proyecto “Inteligencia Artificial basada en Machine Learning Supervisado y Procesamiento de Lenguaje Natural para la identificación de causas de parada en procesos industriales” ha sido subvencionado por el CDTI
© 2017 - 2023 Muutech Legal Advice
    By continuing to browse the site you are agreeing to our use of cookies.