Elasticsearch Status Yellow

Elasticsearch Status Yellow

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

Run the Elasticsearch check-up to receive recommendations like this:

checklist Run Check-Up
error

The following configuration error was detected on node 123...

error-img

Description

This error can have a severe impact on your system. It's important to understand that it was caused by...

error-img

Recommendation

In order to resolve this issue and prevent it from occurring again, we recommend that you begin by changing the configuration to...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]

Overview

Yellow status indicates that one or more of the replica shards on the Elasticsearch cluster are not allocated to a node. No need to panic! There are several reasons why a yellow status can be perfectly normal, and in many cases Elasticsearch will recover to green by itself, so the worst thing you can do is start tweaking things without knowing exactly what the cause is. While status is yellow, search and index operations are still available.

How to resolve

There are several reasons why your Elasticsearch cluster could indicate a yellow status.

1. You only have 1 node

(Or number of replicas >= number of nodes )

Elasticsearch will never assign a replica to the same node as the primary shard, so if you only have one node it is perfectly normal and expected for your cluster to indicate yellow.  If you feel better about it being green, then change the number of replicas on each index to be 0.

PUT /my-index/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

Similarly if the number of replicas is equal to or exceeds the number of nodes, then it will not be possible to allocate one or more of the shards for the same reason.

2. You have restarted a node

If you have temporarily restarted a node, then normally no action is necessary, as Elasticsearch will recover the shards automatically and recover to a green status. You can monitor this process by calling:

GET _cluster/health 

And you will see that the number of unallocated shards progressively reduces until green status is reached.

However if you see that this process is occurring repeatedly, then some other issue is causing the cluster to become unstable and requires investigation.

3. Node crashes

If nodes become overwhelmed or stop operating for any reason, the first symptom will probably be that nodes become yellow or red as the shards fail to sync. Nodes could disconnect due to long GC pauses which occur due to “out of memory” errors or high memory demand due to heavy searches.

4. Networking issues

If nodes are not able to reach each other reliably, then the nodes will lose contact with one another and shards will get out of sync resulting in a red or yellow status. You may be able to detect this situation by finding repeated messages in the logs about nodes leaving or rejoining the cluster.

5. Disk space issues

Insufficient disk space may prevent Elasticsearch from allocating a shard to a node. Typically this will happen when disk utilization goes above the setting below:

cluster.routing.allocation.disk.watermark.low

Here the solution requires deleting indices, increasing disk size, or adding a new node to the cluster.  Of course you can also temporarily increase the watermark to keep things running while you decide what to do, but just putting off the decision until later is not the best course of action.

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "85%",
     "cluster.info.update.interval": "1m"
  }
}

You can also get:

cannot allocate because allocation is not permitted to any of the nodes

Typically this happens when a node disk utilization goes above the flood stage, creating a write block on the cluster. As above, you must delete data, or add a new node. You can buy time with:

PUT _cluster/settings
{
  "transient": {
 
    "cluster.routing.allocation.disk.watermark.flood_stage": "97%",
    "cluster.info.update.interval": "1m"
  }
}

6. Node allocation awareness

Sometimes there may be specific issues with the allocation rules that have been created on the cluster which prevent the cluster from allocating shards. For example, it is possible to create rules that require that a shard’s replicas be spread over a specific set of nodes (“allocation awareness”), such as AWS availability zones or different host machines in a kubernetes setup.  On occasion, these rules may conflict with other rules (such as disk space) and prevent shards being allocated.

Find the cause of non-allocation

You can use the cluster allocation API:

GET /_cluster/allocation/explain

By running the above command, you will get an explanation of the allocation status of the first unallocated shard found.

{
  "index" : "my_index",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2017-01-04T18:53:59.498Z",
    "details" : "node_left[G92ZwuuaRY-9n8_tc-IzEg]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "allocation_delayed",
  "allocate_explanation" : "cannot allocate because the cluster is still waiting 59.8s for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",
  "configured_delay" : "1m",                      
  "configured_delay_in_millis" : 60000,
  "remaining_delay" : "59.8s",                    
  "remaining_delay_in_millis" : 59824,
  "node_allocation_decisions" : [
    {
      "node_id" : "pmnHu_ooQWCPEFobZGbpWw",
      "node_name" : "node_t2",
      "transport_address" : "127.0.0.1:9402",
      "node_decision" : "yes"
    },
    {
      "node_id" : "3sULLVJrRneSg0EfBB-2Ew",
      "node_name" : "node_t0",
      "transport_address" : "127.0.0.1:9400",
      "node_decision" : "no",
      "store" : {                                 
        "matching_size" : "4.2kb",
        "matching_size_in_bytes" : 4325
      },
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[my_index][0], node[3sULLVJrRneSg0EfBB-2Ew], [P], s[STARTED], a[id=eV9P8BN1QPqRc3B4PLx6cg]]"
        }
      ]
    }
  ]
}

The above api returns :

“unassigned_info”  => The reason why the shard became unassigned.

“node_allocation_decision” => A list of explanations for each node explaining whether it could potentially receive the shard.

“deciders” => The decision and the explanation of that decision.

7. Shard has exceeded the maximum number of retries

If the error contains: “shard has exceeded the maximum number of retries [5] on failed allocation attempts – manually call [/_cluster/reroute?retry_failed=true] to retry” need to execute an allocation retry using:

:curl -XPOST localhost:9200/_cluster/reroute?retry_failed


Run the Check-Up to get a customized report like this:

Analyze your cluster