How to Achieve High Availability in ES – Cross Cluster Replication, Load Balancer Approach and Snapshots (/Backup and Restore)
Last Updated : August 2020
High availability in Elasticsearch, as well as data recovery, are two of the most crucial needs for running mission critical ES clusters. Key data is stored in various clusters across ES infrastructure and it’s imperative to be able to retrieve data from a certain cluster should it go down for any reason, while preserving service continuity. There are three common methods for ensuring that data remains at high availability, and each method has its own advantages and disadvantages.
1. Snapshots – Backup and Restore
Firstly, the snapshot method ensures that the data on a chosen cluster is replicated and saved in a different location. In this case, the data is backed up to an external file system, like S3, GCS or any other backend repository that has an official plugin. To ensure high availability with Snapshots, users can designate a periodical backup of a cluster, and the backed up data can then get restored in a secondary cluster.
The primary disadvantage of the Snapshot method is that the data is not backed up and restored in real-time. The built-in delay can cause users to lose valuable data collected in between Snapshots. If, for example, a user designated a Snapshot and Restore process to occur every 5 minutes, the data being backed up is always 5 minutes behind. If a cluster fails 4 minutes after the last Snapshot was taken, 4 minutes of data will be completely lost.
The advantage of this method is that users gain a cloud backup of the data. Should both the primary and secondary clusters crash, users can still access their data in the external file system (minus whatever data was collected in the time since the last Snapshot) and restore it in a new cluster.
Snapshot and Restore Pro and Cons
|Ensures HA and DR - to a certain extent||Data is not backed up in real time - the delay could cause loss of data|
|Is part of the OSS - free and open source||Data is not available for search in real time in the secondary cluster|
|Cloud backup for data|
2.Cross Cluster Replication (CCR)
The Elasticsearch Cross Cluster Replication feature built into ES can be employed to ensure data recovery (DR) and maintain high availability (HA). In CCR, the indices in clusters are replicated in order to preserve the data in them. The replicated cluster is called the remote or cluster, while the cluster with the backup data is known as the local cluster.
There are various challenges involved in using Cross Cluster Replication for DR and HA. These include:
A. Direct connection between the two clusters
Pulling data adds additional load
The replication process is performed on a pull basis, with the local cluster pulling the information from the remote cluster. If we call Cluster A the remote/primary cluster and Cluster B the local/secondary cluster, Cluster B is the one pulling information from Cluster A. This means that both clusters have to perform under increased load. The synchronization of the two adds an additional load to both and impacts disk and network functionality. In order to pull the information, Cluster B has to constantly check with Cluster A if there is anything to pull.
Complicated and manual operation needed to change leaders/followers
In Cross Cluster Replication, you’ll need to configure followers for every index. If Cluster A (the primary) crashes, the same data is replicated in its entirety on Cluster B (the secondary) and remains accessible while Cluster A is fixed. This is how the CCR feature ensures high availability in the case of a cluster failure. While Cluster A is being fixed, the replication process will stop and Cluster B will wait for A to become active again.
Another challenge with Cross Cluster Replication is that it does not allow users to easily designate leaders and followers and later change them according to your needs. To change which index is the leader and which is the follower requires a manual and very complicated operation. To switch a follower index to a leader index, you need to pause replication for the index, close it, unfollow the leader index, and reopen it as a normal ES index.
B. Dependency between clusters
Index settings and number of shards
Index settings must be the same for leaders and followers. The follower index has to have the same number of shards as its leader index, you cannot set the follower index shard count to a higher number than the leader. Therefore, it’s also not possible to replicate multiple leader indices into a single follower index.
The remote clusters are dependent on the local clusters, and keep old history for them, which is one of the downsides of “soft delete” in Lucene. Though useful for updates and deletes, it adds additional complexity that ties the primary clusters with the secondary clusters. After a period of time, the markers will expire and the leader shards will be able to merge away the history.
Setting up new leaders
When setting up a new primary cluster, remote recovery must occur and be completed before the leader index can be used. This is a network and disk intensive process, which A) prevents the primary cluster from working until recovery is over, and B) impacts performance on the primary cluster. This is another example of the complex dependency between the clusters.
In order to replicate to the follower, the primary cluster should be healthy. In cases of cluster service down, there would be no writes to the primary and no replications. This is not an active-active solution.
Older ES versions and running CCR between different versions
If the remote cluster and local cluster don’t belong to the same or compatible versions of ES, the replication process cannot be performed. Also, it is important to note that CCR only works starting ES v6.7.
This also further complicates ES version upgrade, as both indexing and CCR need to be halted before beginning the process.
CCR Pro and Cons
|Ensures high availability||Connection between clusters causes additional load|
|Ensures data recovery||High cluster dependency|
|Occurs in real time - no delay||Does not support active-active mode well|
|Included in the X-Pack||Failover is one-way and difficult to revert|
|Configuration of followers for every index|
|Limited number of shards and index settings in follower clusters|
|Compatibility issues between ES versions|
|Need to purchase the entire X-Pack in order to get CCR - CCR is not open source/free|
|Available only starting ES v6.7|
3. Multi-Cluster Load Balancer
In the Multi-Cluster Load Balancer approach, data is directed through an ES Balancer to various clusters in the most efficient way to provide constant and real time high availability. The Balancer enables both the separation and replication of key data across multiple clusters in real time. It also ensures that the clusters themselves are not overloaded with tasks other than their main purpose. The direction of data to each cluster requires the cluster to index the information, but nothing more.
In both the CCR and the Snapshot methods, errors in the leader affect both clusters, and failover is more complex in CCR. The Multi-Cluster Load Balancer method maintains separate management for each cluster handling data with errors and therefore each error can be taken care of on a case-by-case basis.
When using the Multi-Cluster Balancer, the destination clusters don’t need to communicate between themselves, which spares them the added burden of communication and synchronization. This also allows better separation of Elasticsearch topology and reduces cluster communication. For the same reason, this architecture also solves the issue of Elasticsearch version compatibility and supports high availability across all clusters regardless of version. Opster’s Balancer is backward compatible, until version 1.x. The Balancer only requires configuration once per tenant.
Though the Multi-Cluster Load Balancer requires the addition of another component into the ES system that needs to be monitored and maintained, this can be seen as an advantage. Elasticsearch’s core functionality has always ensured amazing search capabilities and near-real time indexing of fresh data. By moving peripheral tasks to other components, you allow ES to perform its core functionality better. The outsourcing of data recovery and high availability, which are not part of the core tasks, can enable ES’s core to operate at optimal performance and simultaneously provide users with the solution and support they need for high availability.
Multi-Cluster Load Balancer -Pros and Cons
|Ensures High Availability||Necessitates adding an outside component to ES architecture|
|Ensures Data Recovery||Indexing to the secondary cluster might cause higher CPU than syncing already indexed files (but it will reduce primary load on disk and network in exchange)|
|Maximum utilization of every cluster|
|Supports active-active mode|
|No communication between clusters - reduces load|
|No cluster dependency|
|Backwards compatibility back to ES v1.x|
|Separate management for every error|