Elasticsearch Oversharding

By Opster Team

Updated: Mar 10, 2024

| 3 min read

Overview

Oversharding is a status that indicates that you have too many shards, and thus they are too small. While there is no minimum limit for an Elastic shard size, having a larger number of shards on an Elasticsearch cluster requires extra resources since the cluster needs to maintain metadata on the state of all the shards in the cluster.

While there is no absolute limit, as a guideline, the ideal shard size is between a few GB and a few tens of GB. You can learn more about scalability in this official guide.

This issue should be considered in combination with the opposite problem, “Shards Too Big”.

To learn about search latency problems that relates to shards that are too small go to Opster’s search latency guide.

How to resolve the issue

If your shards are too small, then you have 3 options:

Eliminate empty indices

Although empty indices don’t use up disk space, Elasticsearch still has to keep in memory the shard locations and mappings, so it is good practice to regularly detect and delete empty indices on your Elasticsearch cluster. You can use a script to run:

GET _cat/indices?h=index,creation.date,docs.count,health&format=json

This will provide you with a list of indices in JSON format which you can use to check size, age and usage in order to delete empty indices accordingly.

Delete or close indices with old or unnecessary data

DELETE myindex-2014*

POST myindex-2014*/_close

Closing certain indices is often a good quick option to get your cluster running quickly again while you decide on the best long term solution.  Closing an index will release the memory resources being used by the indices while keeping the index on disk, and is easily and quickly reversed (POST myindex/_open).

Re-index into bigger indices

The following would reindex multiple daily or monthly indices into one single index for the year (Make sure you don’t cause the opposite problem of shards too large by adjusting the number of shards on the new index according to the total data volume. Make sure that the shard size on the new index does not exceed 50GB/shard).

POST _reindex
{

  "source": {
	"index": "logs-2017*"
  },
  "dest": {
	"index": "logs-2017"
  }
 
}

Notes to keep in mind

Watch out for field explosions and mappings

When combining smaller indices into a single larger index, remember that the mapping or the new index will contain all of the fields from all of the constituent indices.  For this reason, make sure that the indices you are combining have largely similar mappings.  

Watch out for duplicate data during the reindex process

During the reindex process (please refer to our reindexing tips), if your application uses wildcards in the index name (eg. logs-*) then results may be duplicated during the reindex process since you will have two copies of the data until the process completes. This can be avoided through the use of aliases.

Watch out for modifications and updates during the reindex process

If data is susceptible to updates during the reindex process, then a procedure needs to be identified to capture them and ensure that the new indices reflect all modifications that are made during the transition reindexing period.

PUT _cluster/settings
{
  "transient": {
   
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%",
    "cluster.info.update.interval": "1m"
  }
}

How to avoid the issue

There are various mechanisms to optimize shard size, such as:

Apply ILM (index lifecycle management)

Using ILM you can get Elasticsearch to automatically create a new index when your current index size reaches a given maximum size or age. ILM is suitable for documents which require no updates, but is not a viable option when documents require frequent updating, since to do so you need to know which of the previous index volumes contains the document you want to update.

Change application strategy

You can design your application to ensure that shards are created of a sensible size:

Date based approach

myindex-yyyy.MM.dd

or myindex-yyyy.MM

If you are suffering from oversharding, then you will most likely need to move from daily indices to monthly or yearly indices.

Attribute based approach

myindex-<clientID>

This has the advantage of allowing you to know exactly which index to search/update if you know the client ID. 

The disadvantage is that different clients may produce very different volumes of documents (and hence different shard sizes). This can be mitigated to some extent by adjusting the number of shards per client, but you may create the opposite problem (oversharding) if you have a large number of clients. 

ID range based approach

This may be possible where documents carry an ID coming from another system or application (such as a sql ID, or client ID) which enables us to distribute documents evenly.

eg. myindex-<last_digit_of_ID>

The above would give us 10 indices which will probably be evenly distributed, and also deterministic (we know which index we need to update).

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?