Elasticsearch OpenSearch Yellow Status

By Opster Team

Updated: Jun 27, 2023

| 4 min read

Quick links

Overview

Yellow status indicates that one or more of the replica shards on the OpenSearch cluster are not allocated to a node. No need to panic! There are several reasons why a yellow status can be perfectly normal, and in many cases OpenSearch will recover to green by itself, so the worst thing you can do is start tweaking things without knowing exactly what the cause is. While status is yellow, search and index operations are still available.

How to resolve

Why does an OpenSearch cluster indicate a yellow status?

There are several reasons why your OpenSearch cluster could indicate a yellow status:
1. You only have 1 node
2. You have restarted a node
3. Node crashes
4. Networking issues
5. Disk space issues
6. Node allocation awareness
7. Shard has exceeded the maximum number of retries

1. You only have 1 node

(Or number of replicas >= number of nodes )

OpenSearch will never assign a replica to the same node as the primary shard, so if you only have one node it is perfectly normal and expected for your cluster to indicate yellow.  If you feel better about it being green, then change the number of replicas on each index to be 0.

PUT /my-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

Similarly if the number of replicas is equal to or exceeds the number of nodes, then it will not be possible to allocate one or more of the shards for the same reason.

2. You have restarted a node

If you have temporarily restarted a node, then normally no action is necessary, as OpenSearch will recover the shards automatically and recover to a green status. You can monitor this process by calling:

GET _cluster/health 

And you will see that the number of unallocated shards progressively reduces until green status is reached.

However if you see that this process is occurring repeatedly, then some other issue is causing the cluster to become unstable and requires investigation.

3. Node crashes

If nodes become overwhelmed or stop operating for any reason, the first symptom will probably be that nodes become yellow or red as the shards fail to sync. Nodes could disconnect due to long GC pauses which occur due to “out of memory” errors or high memory demand due to heavy searches.

4. Networking issues

If nodes are not able to reach each other reliably, then the nodes will lose contact with one another and shards will get out of sync resulting in a red or yellow status. You may be able to detect this situation by finding repeated messages in the logs about nodes leaving or rejoining the cluster.

5. Disk space issues

Insufficient disk space may prevent OpenSearch from allocating a shard to a node. Typically this will happen when disk utilization goes above the setting below:

cluster.routing.allocation.disk.watermark.low

Here the solution requires deleting indices, increasing disk size, or adding a new node to the cluster.  Of course you can also temporarily increase the watermark to keep things running while you decide what to do, but just putting off the decision until later is not the best course of action.

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
     "cluster.info.update.interval": "1m"
  }
}

You can also get:

cannot allocate because allocation is not permitted to any of the nodes

Typically this happens when a node disk utilization goes above the flood stage, creating a write block on the cluster. As above, you must delete data, or add a new node. You can buy time with:

PUT _cluster/settings
{
  "transient": {
 
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
    "cluster.info.update.interval": "1m"
  }
}

6. Node allocation awareness

Sometimes there may be specific issues with the allocation rules that have been created on the cluster which prevent the cluster from allocating shards. For example, it is possible to create rules that require that a shard’s replicas be spread over a specific set of nodes (“allocation awareness”), such as AWS availability zones or different host machines in a kubernetes setup.  On occasion, these rules may conflict with other rules (such as disk space) and prevent shards being allocated.

Find the cause of non-allocation

You can use the cluster allocation API:

GET /_cluster/allocation/explain

By running the above command, you will get an explanation of the allocation status of the first unallocated shard found.

{
  "index" : "my_index",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-01-04T18:53:59.498Z",
    "details" : "node_left[G92ZwuuaRY-9n8_tc-IzEg]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "allocation_delayed",
  "allocate_explanation" : "cannot allocate because the cluster is still waiting 59.8s for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",
  "configured_delay" : "1m",                      
  "configured_delay_in_millis" : 60000,
  "remaining_delay" : "59.8s",                    
  "remaining_delay_in_millis" : 59824,
  "node_allocation_decisions" : [
    {
      "node_id" : "pmnHu_ooQWCPEFobZGbpWw",
      "node_name" : "node_t2",
      "transport_address" : "127.0.0.1:9402",
      "node_decision" : "yes"
    },
    {
      "node_id" : "3sULLVJrRneSg0EfBB-2Ew",
      "node_name" : "node_t0",
      "transport_address" : "127.0.0.1:9400",
      "node_decision" : "no",
      "store" : {                                 
        "matching_size" : "4.2kb",
        "matching_size_in_bytes" : 4325
      },
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[my_index][0], node[3sULLVJrRneSg0EfBB-2Ew], [P], s[STARTED], a[id=eV9P8BN1QPqRc3B4PLx6cg]]"
        }
      ]
    }
  ]
}

The above api returns :

“unassigned_info”  => The reason why the shard became unassigned.

“node_allocation_decision” => A list of explanations for each node explaining whether it could potentially receive the shard.

“deciders” => The decision and the explanation of that decision.

7. Shard has exceeded the maximum number of retries

If the error contains: “shard has exceeded the maximum number of retries [5] on failed allocation attempts – manually call [/_cluster/reroute?retry_failed=true] to retry” need to execute an allocation retry using:

:curl -XPOST localhost:9200/_cluster/reroute?retry_failed

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?