In addition to reading this guide, run the Elasticsearch Health Check-Up. Detect problems and improve performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and many more.
Free tool that requires no installation with +1000 users.
If you’re suffering from search latency issues, Opster’s Search Gateway might be the best solution for you. The Gateway allows for easy detection of slow searches and automated actions to block heavy searches and prevent them from breaking your cluster.
You can also run Opster Elasticsearch check-up which detects issues that cause search latency. It’s a free tool that does not require any installation
Search Queries Slow Logs can be very handy while troubleshooting Elasticsearch performance issues. There are two main operations in Elasticsearch (search and indexing) and both are logged separately.
This troubleshooting snippet targets the Search heavy systems where search TPS (transactions per second) is much higher than the indexing TPS, such as with e-commerce sites or medium, Quora-like platforms.
Slow queries are often caused by:
- Poorly written or expensive search queries.
- Poorly configured Elasticsearch clusters or indices.
- Saturated CPU, Memory, Disk and network resources on the cluster.
- Periodic background processes like snapshots or merging segments that consume cluster resources (CPU, Memory, disk) causing other search queries to perform slowly as resources are sparsely available for the main search queries.
- Segment merging is used to reduce the number of segments so that search latency is improved, however, merges can be expensive to perform, especially on low IO environments.
As mentioned above there are several potential reasons for slow queries, but in search heavy systems, the main causes are usually expensive search queries or a poorly configured Elasticsearch cluster or index. Effective use of search slow queries could dramatically reduce the debugging/troubleshooting time.
Troubleshooting guide on how to use search slow queries effectively:
- Always define a proper log-threshold for search slow queries in your application. Define different log levels for faster-debugging purposes. For example, more than 20ms is good for TRACE logging but more than 250ms should be logged as WARN. These thresholds are for real-time systems like e-commerce search and it should be tuned on the basis of application SLA.
- There are two phases of search: the query phase and the fetch phase. More details can be found here on Elasticsearch Search Explained. It’s important to understand how these phases work and set a proper threshold for each one.
- Slow logs are always specific to a shard, and this is where most people get it wrong when they look at a slow queries log without understanding the full picture. More information on how ES shards play an important role in its performance can be found on the Search Latency troubleshooting guide.
- A slow query log includes the phase to which it belongs and the default is query then fetch but it can be DFS query then fetch which provides better search score by taking a performance hit. Hence it’s important to look for these.
- Adding the response times of all slow queries involving all of the relevant shards doesn’t provide the overall search time and doesn’t include the time of gathering results from all shards and fetching the top results (aka fetch phase).
- To solve the issue mentioned in #4, always have a trace log in your application which tracks the “took” param of Elasticsearch response. This is essentially the total time taken by a single Elasticsearch query on all its relevant shards and fetching top results from all shards. “Took” parameter of Elasticsearch response is the correct indicator of the total time taken by a query (including time spent on sending requests to all relevant shards and gathering and combining the results from all shards).
- If you are dealing with a multi-tenant ES cluster that hosts multiple indices then just checking the slow logs of one problematic index won’t be sufficient, as sometimes slow logs on the problematic index are caused by other indices heavy searches. Therefore it’s always a better idea to look at slow searches in the entire cluster(or at least in big indices) when issues start.
- Some examples of heavy search queries are regex queries, prefix queries, heavy aggregations, match_all, the huge value of size parameter, and deep paginations queries. Filter search slow logs for these queries and see how these queries are performing.