Elasticsearch  High Disk Watermark

Elasticsearch  High Disk Watermark

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

Run the Elasticsearch check-up to receive recommendations like this:

checklist Run Check-Up
error

The following configuration error was detected on node 123...

error-img

Description

This error can have a severe impact on your system. It's important to understand that it was caused by...

error-img

Recommendation

In order to resolve this issue and prevent it from occurring again, we recommend that you begin by changing the configuration to...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]

Overview

There are various “watermark” thresholds on your Elasticsearch cluster. As the disk fills up on a node, the first threshold to be crossed will be the “low disk watermark”. The second threshold will then be the “high disk watermark”. If you pass this threshold then Elasticsearch will try to relocate shards away from the node to other nodes in the cluster.

How to resolve this issue

Passing this threshold is a warning and you should not delay in taking action before the higher threshold  flood_stage is reached. Here are possible actions you can take to resolve the issue:

  • Delete old indices
  • Remove documents from existing indices
  • Reduce the number of replicas (on older indices)
  • “Increase disk space on all nodes
  • Add new nodes to the cluster

Although you may be reluctant to delete data, in a logging system it is often better to delete old indices (which you may be able to restore from a snapshot later if available) than to lose new data.  However, this decision will depend upon the architecture of your system and the queueing mechanisms you have available.

Check the disk space on each node

You can see the space you have available on each node by running:

GET _nodes/stats/fs

Check if the cluster is rebalancing

If the high level watermark has been passed, then Elasticsearch should start rebalancing shards from that node to other nodes which are still below the low watermark.  You can check to see if any rebalancing is going on by calling:

GET _cluster/health/

If you think that your cluster should be rebalancing shards to other nodes but it is not, there are probably some other cluster allocation rules which are preventing this from happening. The most likely causes are:

  • The other nodes are already above the low disk watermark
  • There are cluster allocation rules which govern the distribution of shards between nodes and conflict with the rebalancing requirements. (eg. zone awareness allocation).
  • There are already too many rebalancing operations in progress
  • The other nodes already contain the primary or replica shards of the shards that could be rebalanced.

Check the cluster settings

You can see the settings you have applied with this command:

GET _cluster/settings

If they are not appropriate, you can modify them using a command such as below:

PUT _cluster/settings
{
  "transient": {
   
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%",
    "cluster.info.update.interval": "1m"
  }
}

Note: Threshold can be specified both as percentage and byte values, but the former is more flexible and easier to maintain (in case different nodes have different disk sizes, like in hot/warm deployments).

How to prevent

There are various mechanisms that allow you to automatically delete stale data.

How to automatically delete stale data:

  1. Apply ILM (Index Lifecycle Management)

    Using ILM you can get Elasticsearch to automatically delete an index when your current index reaches a given age.

  2. Use date based indices

    If your application uses date based indices, then it is easy to delete old indices using either a script, ILM or a tool such as Elasticsearch curator.

  3. Use snapshots to store data offline

    It may be appropriate to store snapshotted data offline and restore it in the event that the archived data needs to be reviewed or studied.

  4. Automate / simplify process to add new data nodes

    Use automation tools such as terraform to automate the addition of new nodes to the cluster. If this is not possible, at the very least ensure you have a clearly documented process to create new nodes, add TLS certificates and configuration and bring them into the Elasticsearch cluster in a short and predictable time frame.

Run the Elasticsearch check-up to receive recommendations like this:

checklist Run Check-Up
error

The following configuration error was detected on node 123...

error-img

Description

This error can have a severe impact on your system. It's important to understand that it was caused by...

error-img

Recommendation

In order to resolve this issue and prevent it from occurring again, we recommend that you begin by changing the configuration to...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]



Improve Elasticsearch Performance

Run The Analysis