Elasticsearch Shards Too Large

Elasticsearch Shards Too Large

Last Update: July 2020

Have you tried our Elasticsearch Check-Up ? (no installation required) get personalized recommendations that can prevent errors and improve your search and indexing speed

What Does it Mean?

It is a best practice that Elasticsearch shard size should not go above 50GB for a single shard.  

The limit for shard size is not directly enforced by Elasticsearch. However, if you go above this limit you can find that Elasticsearch is unable to relocate or recover index shards (with the consequence of possible loss of data) or you may reach the lucene hard limit of 2 ³¹ documents per index.

How to Resolve it

If your shards are too large, then you have 3 options:

1. Delete records from the index

If appropriate for your application, you may consider permanently deleting records from your index (for example old logs or other unnecessary records).

POST /my-index/_delete_by_query
{
     "query": {
        "range" : {
            "@timestamp" : {
          
                 "lte" : now-5d

            }
        }
    }
}

2. Re-index into several small indices

The following would reindex into one index for each event_type.

POST _reindex
{

  "source": {
	"index": "logs-all-events"
  },
  "dest": {
	"index": "logs-2-"
  },
  "script": {
	"lang": "painless",
	"source": "ctx._index = 'logs-2-' + (ctx._source.event_type)"
  }
}

3. Re-index into another single index but increase the number of shards:

PUT /my_new_index/_settings
{
    "index" : {
        "number_of_shards" : 2
    }
}
POST _reindex
{
  "source": {
    "index": "my_old_index"
  },
  "dest": {
    "index": "my_new_index"
  }
}

Care Points for Reindexing

Notice duplicate data during the reindex process

During the reindex process, if your application uses wildcards in the index name (eg. logs-*) then results may be duplicated during the reindex process since you will have two copies of the data until the process completes. This can be avoided through the use of aliases.

Notice modifications and updates during the reindex process

If data is susceptible to updates during the reindex process, then a procedure needs to be identified to capture them and ensure that the new indices reflect all modifications that are made during the transition reindexing period. For more information please refer to Opster tips on reindexing  

How to Avoid it

There are various mechanisms to optimize shard size.

Apply ILM (Index Lifecycle Management)

Using ILM you can get Elasticsearch to automatically create a new index when your current index size reaches a given maximum size or age.  ILM is suitable for documents which require no updates, but is not a viable option when documents require frequent updating, since to do so you need to know which of the previous index volumes contains the document you want to update.

Application Strategy

You can design your application to ensure that shards are created of a sensible size:

Date based approach: 

myindex-yyyy.MM.dd

or myindex-yyyy.MM

This has the advantage that it is easy to delete old data simply by deleting older indices.

Attribute based approach

myindex-<clientID>

This has the advantage of allowing you to know exactly which index to search/update if you know the client ID. 

The disadvantage is that different clients may produce very different volumes of documents (and hence different shard sizes). This can be mitigated to some extent by adjusting the number of shards per client, but you may create the opposite problem (oversharding) if you have a large number of clients.  

ID range based approach.

This may be possible where documents carry an ID coming from another system or application (such as a sql ID, or client ID) which enables us to distribute documents evenly.

eg. myindex-<last_digit_of_ID>

The above would give us 10 indices which will probably be evenly distributed, and also deterministic (we know which index we need to update).


About Opster

Opster takes a different approach to Elasticsearch operation - Opster pro-actively troubleshoots, optimizes, automates and assists in what's needed to run Elasticsearch smoothly in production.


« Back to Index