Introduction

Disk management is important in any database, and Elasticsearch is no exception. If you don’t have enough disk space available, Elasticsearch will stop allocating shards to the node. This will eventually prevent you from being able to write data to the cluster, with the potential risk of data loss in your application. On the other hand, if you have too much disk space, then you are paying for more resources than you need.

Background on watermarks

There are various “watermark” thresholds on your Elasticsearch cluster which help you track the available disk space. As the disk fills up on a node, the first threshold to be crossed will be the “low disk watermark”. The second threshold will then be the “high disk watermark threshold”. Finally, the “disk flood stage” will be reached. Once this threshold is passed, the cluster will then block writing to ALL indices that have one shard (primary or replica) on the node which has passed the watermark. Reads (searches) will still be possible.

How to prevent and handle cases when disk is too full (over utilization)

There are various methods for handling cases when your Elasticsearch disk is too full:

1. Delete old data – usually, data should not be kept indefinitely. One way to prevent and solve disk being too full is by ensuring that when data reaches a certain age, it gets reliably archived and deleted. One way to do this is to use ILM.

2. Add storage capacity – If you cannot delete the data, you might want to add more data nodes or increase the disk sizes in order to retain all the data without negatively affecting performance. If you need to add storage capacity to the cluster, you should consider whether you need to add just storage capacity alone, or both storage capacity and also RAM and CPU resources in proportion (see section on Ratio of Disk Size and RAM and CPU below).

Time needed: 5 minutes

How to add storage capacity to your Elasticsearch cluster:

Increase the number of data nodes
Remember that the new nodes should be of the same size as existing nodes, and of the same Elasticsearch version.
Increase the size of existing nodes
In cloud-based environments, it is usually easy to increase disk size and RAM/CPU on existing nodes.
Increase only the disk size
In cloud-based environments, it is often relatively easy to increase disk size.

3. Snapshot and restore – If you are willing to allow old data to be retrieved upon request in an automated process from backups, you can snapshot old indices, delete them and restore data temporarily upon request from the snapshots.

4. Reduce replicas per shard – Another option to reduce data is to reduce the number of replicas of each shard. For high availability, you would like to have one replica per shard, but when data grows older, you might be able to work without replicas. This could usually work if the data is persistent, or you have a backup to restore if needed.

5. Create alerts – In order to prevent disks from filling up in the future and act proactively, you should create alerts based on disk usage that will notify you when the disk starts filling up.

AutoOps proactively monitors, alerts and recommends how to resolve these cases before they become an incident. Learn more here.

How to prevent and handle cases when the disk capacity is underutilized

If your disk capacity is underutilized, there are various options to reduce the storage volume on your cluster.

How to reduce the storage volume on an Elasticsearch cluster

There are various methods for how to reduce the storage volume of a cluster.

1. Reduce the number of data nodes –

If you want to reduce data storage, and also reduce RAM and CPU resources in the same proportion, then this is the easiest strategy. Decommissioning unnecessary nodes is likely to provide the greatest cost savings.

Before decommissioning the node, you should:

Ensure that the node to be decommissioned is not necessary as a MASTER node. You should always have at least three nodes with the MASTER node role.
Migrate the data shards away from the node to be decommissioned.

2. Replace existing nodes with smaller nodes –

If you cannot further reduce the number of nodes (usually 3 would be a minimum configuration), then you may want to downsize existing nodes. Remember that it is advisable to ensure that all data nodes are of the same RAM memory and disk size, since the shards balance on the basis of number of shards per node.

The process would be:

Add new, smaller nodes to the cluster
Migrate the shards away from the nodes to be decommissioned
Shut down the old nodes

3. Reduce disk size on nodes –

If you ONLY want to reduce disk size on the nodes without changing the cluster’s overall RAM or CPU, then you can reduce the disk size for each node. Reducing disk size on an Elasticsearch node is not a trivial process.

The easiest way to do so would usually be to:

Migrate shards from the node
Stop the node
Mount a new data volume to the node with appropriate size
Copy all data from old disk volume to new volume
Detach old volume A
Start node and migrate shards back to node

This requires that you have sufficient capacity on the other nodes to temporarily store the extra shards from the node during this process. In many cases the cost of managing this process may exceed the potential savings in disk usage. For this reason, it may be simpler to replace the node altogether with a new node with the desired disk size (see “Replace existing nodes with smaller nodes” above).

When paying for unnecessary resources, cost can obviously be reduced by optimizing your resource utilization. Get specific recommendations for how to optimize your own system by running Cost Insight.

The relationship between Disk Size, RAM and CPU

The ideal ratio of disk capacity to RAM in your cluster will depend on your particular use case. For this reason, when considering changes to your storage capacity, you also should consider whether your current Disk/RAM/CPU ratios are suitably balanced and whether as a consequence you also need to add/reduce RAM/CPU in the same proportion.

RAM and CPU requirements depend on the volume of indexing activity, the number and type of queries, and also the amount of data that is being searched and aggregated. This is often in proportion to the amount of data being stored on the cluster, and therefore should also be related to disk size.

The ratio between the disk capacity and the RAM can change based on the use case. See a few examples here:

	Index Activity	Retention	Search Activity	Disk Capacity	RAM
Enterprise search app	Moderate log ingestion	Long	Light	2TB	32GB
App monitoring	Intensive log ingestion	Short	Light	1TB	32GB
E-commerce	Light data indexing	Indefinite	Heavy	500GB	32GB

Remember that modifying the configuration of node machines must be done with care, since it may involve node downtime and you need to ensure that shards do not start to migrate to your other already over-stretched nodes.