Dashboards are for the summer
August 13, 2019Advantages of using a SaaS cloud monitoring system
September 12, 2019High availability on your monitoring servers
Monitoring systems are a fundamental part of any technology company today. Therefore, it is very necessary that they are always available to be consulted. If this is not achieved, any problem with the service can result in a catastrophic failure for our company.
The traditional way to solve this would be to detect the system crash and launch a new server manually, with the configuration that this entails. Muutech set out to automate this process. Due to the characteristics of Zabbix, it is not possible to have several active servers at the same time. So we have to look for a solution that works for servers that work in active-passive, that is, while one server has the service ready and launched, the other is ready to launch when required.
The main objective is to create a network so that the Zabbix service does not suffer availability problems. For this the first step is to see the possible problems to solve:
This solution is effective for a network that goes from three nodes to several hundred of them, so it is completely scalable. But to simplify the explanation, let's put the simplest example, which would be with three nodes, but then explain how it would work with more.
First of all, we need to install Zookeeper on all three nodes of our network. And configure them in such a way that they replicate the information between them, to avoid data loss. We need at least three nodes in the network to get a quorum through a healthy cluster.
A quorum is the set of nodes that are replicated within the same cluster and the same application. All these nodes have the same copy of the configuration file (explained below), since they must be perfectly synchronized between them.
A healthy cluster is achieved when there are three or more servers within a quorum, since in this way, of losing the connection with a node, a "majority" of nodes with synchronized data can be maintained. To see it more clearly we expose the following example: we have two nodes, if the connection between them were lost, when they were connected again, both would have different data. In that case, what would be the correct data to synchronize? There is no simple way to solve this problem, for this reason, the minimum number of nodes to have a healthy cluster is considered to be three.
This is achieved with the following steps:
1. We write the following in the configuration files:
2. We will now proceed to run the servers, either manually or through a service. The description of the service is given below:
3. Now, in the servers where the Zabbix service is installed, we install our client, which, with a small configuration, is already in charge of maintaining synchronization with the Zookeeper servers.
The resulting architecture in a case would be something like that:
The traditional way to solve this would be to detect the system crash and launch a new server manually, with the configuration that this entails. Muutech set out to automate this process. Due to the characteristics of Zabbix, it is not possible to have several active servers at the same time. So we have to look for a solution that works for servers that work in active-passive, that is, while one server has the service ready and launched, the other is ready to launch when required.
The main objective is to create a network so that the Zabbix service does not suffer availability problems. For this the first step is to see the possible problems to solve:
- A complete server crash.
- Network crash of the server on which Zabbix is running.
- Downtime of the offered service due to an error in its execution.
This solution is effective for a network that goes from three nodes to several hundred of them, so it is completely scalable. But to simplify the explanation, let's put the simplest example, which would be with three nodes, but then explain how it would work with more.
Simple Solution
To achieve high availability, we opted to use the technology of Zookeeper, sub-process of Hadoop. It is a widely tested and contrasted technology, used by several large companies (Youtube, eBay, Yahoo, among others) and software (PTC ThingWorx, Patroni, etc.). This gives us an idea that it can be very effective.First of all, we need to install Zookeeper on all three nodes of our network. And configure them in such a way that they replicate the information between them, to avoid data loss. We need at least three nodes in the network to get a quorum through a healthy cluster.
A quorum is the set of nodes that are replicated within the same cluster and the same application. All these nodes have the same copy of the configuration file (explained below), since they must be perfectly synchronized between them.
A healthy cluster is achieved when there are three or more servers within a quorum, since in this way, of losing the connection with a node, a "majority" of nodes with synchronized data can be maintained. To see it more clearly we expose the following example: we have two nodes, if the connection between them were lost, when they were connected again, both would have different data. In that case, what would be the correct data to synchronize? There is no simple way to solve this problem, for this reason, the minimum number of nodes to have a healthy cluster is considered to be three.
This is achieved with the following steps:
1. We write the following in the configuration files:
tickTime=2000 initLimit=10 syncLimit=5 dataDir=/opt/zookeeper/data clientPort=2181 server.1=zoo1ip:port1:port2 server.2=zoo2ip:port1:port2 server.3=zoo3ip:port1:port2Where:
- tickTime: The time it takes to send each ping.
- initLimit: It is the number of ping that has the first phase of synchronization.
- syncLimit: It is the maximum number of pings that can pass between a ping and its answer, if it is exceeded, the connection is considered lost.
- dataDir: Where the information of the nodes of Zookeeper is stored.
- clientPort: Port through which the clients are going to connect.
2. We will now proceed to run the servers, either manually or through a service. The description of the service is given below:
[Unit] Description=Zookeeper Service [Service] Type=simple WorkingDirectory=$ZOOKEEPER_PATH_OF_INSTALLATION PIDFile=$ZOOKEEPER_PATH_OF_INSTALLATION/data/zookeeper_server.pid SyslogIdentifier=zookeeper User=zookeeper Group=zookeeper ExecStart=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh start ExecStop=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh stop Restart=always TimeoutSec=20 SuccessExitStatus=130 143 Restart=on-failure [Install] WantedBy=This way Zookeeper will start automatically.
3. Now, in the servers where the Zabbix service is installed, we install our client, which, with a small configuration, is already in charge of maintaining synchronization with the Zookeeper servers.
The resulting architecture in a case would be something like that:
In it we see how our client Zookeeper would be in charge of maintaining the connection with the servers. In this way, the passive server would know the status of the active server at all times, in case it was necessary to be activated.
Now let's see how we would act in each of the failure cases mentioned above:
In the event of a service failure due to an error in its execution, we would see that the Linux service system (systemd) tries to relaunch it, if it does not succeed, our client will stop the service, inform the Zookeeper server and this in turn, inform the passive server that has to be activated.
Now let's see how we would act in each of the failure cases mentioned above:
In the event of a service failure due to an error in its execution, we would see that the Linux service system (systemd) tries to relaunch it, if it does not succeed, our client will stop the service, inform the Zookeeper server and this in turn, inform the passive server that has to be activated.
In the event of a complete server crash, the Zookeeper server would detect a loss of connection from the active server, so it considers that the service is down. Because of this, it would inform the passive server that it has to be activated.
In the latter case, a network crash of the server on which Zabbix is running, our client checks against another server if it has crashed its network interfaces. If so, for your service so that, when you have a network again, both servers are not active. The Zookeeper server and the passive server will work exactly the same as in the previous case, so there is an active server.
In the latter case, a network crash of the server on which Zabbix is running, our client checks against another server if it has crashed its network interfaces. If so, for your service so that, when you have a network again, both servers are not active. The Zookeeper server and the passive server will work exactly the same as in the previous case, so there is an active server.
If there is more than one passive server, all of them would be queued to wait their "turn". Therefore, if the active server were to fall, one of the liabilities would take its place while it would have other liabilities pending from its status.
Conclusion
As you can easily see, this solution results in a total improvement in system availability of the monitoring platform, as the system is able to react automatically (failover) to various problems: network loss, server or service downtime, etc.CEO & MANAGING DIRECTOR
Expert in IT monitoring, systems and networks.
Minerva is our enterprise-grade monitoring platform based on Zabbix and Grafana.
We help you monitor your network equipment, communications and systems!
Subscribe to our Newsletter