Elasticsearch Slow Query Troubleshooting Guide

Elasticsearch Slow Query Troubleshooting Guide

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.


If you’re suffering from search latency issues or poor search performance, you should run Opster’s free Search Log Analyzer to optimize your searches.

With Opster’s Analyzer, you can easily locate slow searches and understand what led to them adding additional load to your system. You’ll receive customized recommendations for how to reduce search latency and improve your search performance. The tool is free and takes just 2 minutes to run.

Run the Search Log Analyzer to receive recommendations like this:

checklist Run Search Log Analyzer
error

These are the 3 slowest searches in your system: 18738, 38578, 88364

error-img

Description

These searches are causing queues to build, increasing search response times or provoking time-outs. The slowest search took 4.399 seconds, and 14% of your searches were slower than average.

error-img

Recommendations

According to your specific ES deployment, you should change the following configurations in order to improve search speed...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]

Overview

How to use slow logs to detect and troubleshoot issues related to slow queries. 

To read more about slow logs and how to use them read this guide on how to activate and use Elasticsearch slow logs

Slow queries are often caused by

  1.  Poorly written or expensive search queries.
  2.  Poorly configured Elasticsearch clusters or indices.
  3.  Saturated CPU, Memory, Disk and network resources on the cluster.
  4.  Periodic background processes like snapshots or merging segments that consume cluster resources (CPU, Memory, disk) causing other search queries to perform slowly as resources are sparsely available for the main search queries.
  5. Segment merging is used to reduce the number of segments so that search latency is improved, however, merges can be expensive to perform, especially on low IO environments.

As mentioned above there are several potential reasons for slow queries, but in search heavy systems, the main causes are usually expensive search queries or a poorly configured Elasticsearch cluster or index. Effective use of search slow queries could dramatically reduce the debugging/troubleshooting time.

Troubleshooting guide on how to use slow logs effectively

  1. Always define a proper log-threshold for search slow queries in your application. Define different log levels for faster-debugging purposes. For example, more than 20ms is good for TRACE logging but more than 250ms should be logged as WARN. These thresholds are for real-time systems like e-commerce search and it should be tuned on the basis of application SLA.
  2. There are two phases of search: the query phase and the fetch phase. More details can be found here on Elasticsearch search explained. It’s important to understand how these phases work and set a proper threshold for each one.
  3. Slow logs are always specific to a shard, and this is where most people get it wrong when they look at a  slow queries log without understanding the full picture. More information on how ES shards play an important role in its performance can be found on the search latency troubleshooting guide.
  4. A slow query log includes the phase to which it belongs and the default is query then fetch but it can be DFS query then fetch which provides better search score by taking a performance hit. Hence it’s important to look for these. 
  5. Adding the response times of all slow queries involving all of the relevant shards doesn’t provide the overall search time and doesn’t include the time of gathering results from all shards and fetching the top results (aka fetch phase).
  6. To solve the issue mentioned in #4, always have a trace log in your application which tracks the “took” param of Elasticsearch response. This is essentially the total time taken by a single Elasticsearch query on all its relevant shards and fetching top results from all shards. “Took” parameter of Elasticsearch response is the correct indicator of the total time taken by a query (including time spent on sending requests to all relevant shards and gathering and combining the results from all shards).
  7. If you are dealing with a multi-tenant ES cluster that hosts multiple indices then just checking the slow logs of one problematic index won’t be sufficient, as sometimes slow logs on the problematic index are caused by other indices heavy searches. Therefore it’s always a better idea to look at slow searches in the entire cluster(or at least in big indices) when issues start.
  8. Some examples of heavy search queries are regex queries, prefix queries, heavy aggregations, match_all, the huge value of size parameter, and deep paginations queries. Filter search slow logs for these queries and see how these queries are performing.


Run the Check-Up to get a customized report like this:

Analyze your cluster