Elasticsearch OpenSearch Rolling Restart: How to Perform Rolling Restarts

By Opster Team

Updated: Jun 28, 2023

| 4 min read

Quick links:

Introduction 

Restarting nodes in distributed systems like OpenSearch is always challenging, because most of the time there is a service running on these distributed systems and you have to keep the cluster up and running while reconfiguring your nodes. Safely changing your configuration while maintaining cluster availability comes with certain challenges. In this guide we’ll cover the challenges and how to overcome them.

What is a rolling restart?

Rolling restarts are relevant for distributed applications. Most distributed applications are built to avoid downtime. Based on this approach, we need a solution for loading configuration changes in  distributed applications while all nodes are up and running. 

A rolling restart allows one to reload new configurations, node by node, without losing the high availability of distributed systems, in this case OpenSearch.

The difference between full cluster restarts and rolling restarts

A full cluster restart is when we stop all the nodes at once. In rolling restarts, we stop the nodes one-by-one and change the configurations that we want while avoiding cluster downtime.

Restart mechanism in OpenSearch clusters

The first thing to keep in mind is that whenever a single OpenSearch node stops running, whether on purpose or by accident (a crash), OpenSearch will start recovering the data of that node from the other nodes after a minute has passed. This means that if the crashed node doesn’t recover during that time frame, the data recovery process will start. OpenSearch is designed to not have any SPOF (Single Point of Failure). This behavior maintains high availability in OpenSearch clusters. 

How to restart nodes without any issues 

*Throughout the steps ahead, keep in mind that whenever we change configuration on the cluster level, we should always be sure to change it back to the default afterwards to maintain high availability of the cluster. 

You can make changes to configuration and prepare your cluster for a restart using the following settings:

  1. Adjust node_left timeout 
  2. Disable shard allocation 

How to restart nodes without any issues 

  1. Adjust “node_left” timeout

  2. Disable shard allocation

  3. Change the configuration as needed

  4. Restart the OpenSearch node

  5. Reset the settings to default

  6. Repeat the steps

Step 1: Adjust “node_left” timeout 

This setting is set to 1 minute by default, which means if any OpenSearch node leaves the cluster for any reason (network issues, manual restart,…) for over 1 minute, the cluster will consider the node lost and begin the data recovery process from other available nodes and incur unneeded data transfer costs.

You can avoid this type of issue by changing the node_left setting at the cluster level. Changing the node_left configuration setting from 1 minute to 5 minutes will usually give you ample time to restart your node. The cluster will wait 5 minutes for the node to rejoin the cluster, giving you time to restart as needed. 

You can use this command to change the node_left option: 

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "5m"
  }
}

*Note: you can use any time unit that you need, read more.

Step 2: Disable shard allocation

Disabling shard allocation is another setting that you have to adjust before starting any restart processes in your cluster. 

If you do not disable shard allocation and your node does not rejoin the cluster in the time parameter chosen, shard allocation is the process that will start recovering data of the lost node. Disabling shard allocation is simply disabling the next step in the process so that no unwanted shard allocation will occur. 

After configuration changes have been applied and nodes have been restarted, don’t forget to change all settings back to their original state so as not to ruin your data availability.

*Note: your OpenSearch cluster status might change from Green to Red or Yellow because of unassigned shards if any shard allocation process starts. 

You can use this command to disable allocation:

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "none"
  }
}

OpenSearch Status

These are the cluster statuses you may encounter: 

Unassigned primary shards = Red Status
Unassigned replica shards = Yellow Status
All shards assigned = Green Status

Step 3: Change the configuration as needed

In this step you can change any option that you need in the opensearch.yml file, such as adding repos or any other options. It’s important to be careful with configuration changes to ensure they won’t cause the node to fail to start.  

You can check the OpenSearch log every time you restart the service to make sure everything is sound: 

/var/log/opensearch/opensearch.log

Step 4: Restart the OpenSearch node

After completing the steps above, you can restart the OpenSearch node to reload your new configuration. 

systemctl start opensearch

Step 5: Reset the settings to default

Whenever we change configuration on the cluster level, we should always make sure to change it back to the default in order to keep the OpenSearch cluster highly available. After restarting the OpenSearch node successfully, you can reset all configurations back to the default. 

Reset the configurations as below: 

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": null
  }
}
PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "all"
  }
}

Step 6: Repeat the steps

Repeat as needed to change the configuration of other nodes.

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?