Monitoring systems are a fundamental part of any technology company today. Therefore, it is very necessary that they are always available to be consulted. If this is not achieved, any problem with the service can result in a catastrophic failure for our company.
The traditional way to solve this would be to detect the system crash and launch a new server manually, with the configuration that this entails. Muutech set out to automate this process. Due to the characteristics of Zabbix, it is not possible to have several active servers at the same time. So we have to look for a solution that works for servers that work in active-passive, that is, while one server has the service ready and launched, the other is ready to launch when required.
The main objective is to create a network so that the Zabbix service does not suffer availability problems. For this the first step is to see the possible problems to solve:
- A complete server crash.
- Network crash of the server on which Zabbix is running.
- Downtime of the offered service due to an error in its execution.
Each of these faults must be treated differently from each of the nodes in the network. And, of course, in a way that is completely transparent to the user. To achieve these objectives it is necessary to design an architecture capable of responding quickly and effectively to these errors, as well as preparing the different machines for them.
This solution is effective for a network that goes from three nodes to several hundred of them, so it is completely scalable. But to simplify the explanation, let's put the simplest example, which would be with three nodes, but then explain how it would work with more.
Simple Solution
To achieve high availability, we opted to use the technology of Zookeeper, sub-process of Hadoop. It is a widely tested and contrasted technology, used by several large companies (Youtube, eBay, Yahoo, among others) and software (PTC ThingWorx, Patroni, etc.). This gives us an idea that it can be very effective.
First of all, we need to install
Zookeeper on all three nodes of our network. And configure them in such a way that they replicate the information between them, to avoid data loss. We need at least three nodes in the network to get a quorum through a healthy cluster.
A
quorum is the set of nodes that are replicated within the same cluster and the same application. All these nodes have the same copy of the configuration file (explained below), since they must be perfectly synchronized between them.
A healthy cluster is achieved when there are three or more servers within a quorum, since in this way, of losing the connection with a node, a "majority" of nodes with synchronized data can be maintained. To see it more clearly we expose the following example: we have two nodes, if the connection between them were lost, when they were connected again, both would have different data. In that case, what would be the correct data to synchronize? There is no simple way to solve this problem, for this reason, the minimum number of nodes to have a healthy cluster is considered to be three.
This is achieved with the following steps:
1. We write the following in the configuration files:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/opt/zookeeper/data
clientPort=2181
server.1=zoo1ip:port1:port2
server.2=zoo2ip:port1:port2
server.3=zoo3ip:port1:port2
Where:
- tickTime: The time it takes to send each ping.
- initLimit: It is the number of ping that has the first phase of synchronization.
- syncLimit: It is the maximum number of pings that can pass between a ping and its answer, if it is exceeded, the connection is considered lost.
- dataDir: Where the information of the nodes of Zookeeper is stored.
- clientPort: Port through which the clients are going to connect.
The servers are described in each node, including itself. Above this, we need a file called "myid" in the folder indicated in "dataDir". This file will contain a unique number within the Zookeeper cluster (usually the one corresponding to it is put in server.X).
2. We will now proceed to run the servers, either manually or through a service. The description of the service is given below:
[Unit]
Description=Zookeeper Service
[Service]
Type=simple
WorkingDirectory=$ZOOKEEPER_PATH_OF_INSTALLATION
PIDFile=$ZOOKEEPER_PATH_OF_INSTALLATION/data/zookeeper_server.pid
SyslogIdentifier=zookeeper
User=zookeeper
Group=zookeeper
ExecStart=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh start
ExecStop=$ZOOKEEPER_PATH_OF_INSTALLATION/bin/zkServer.sh stop
Restart=always
TimeoutSec=20
SuccessExitStatus=130 143
Restart=on-failure
[Install]
WantedBy=
This way Zookeeper will start automatically.
3. Now, in the servers where the Zabbix service is installed, we install our client, which, with a small configuration, is already in charge of maintaining synchronization with the Zookeeper servers.
The resulting architecture in a case would be something like that: