Last updated: November 2022
Developer forums are riddled with questions about OpenSearch errors and exceptions. Although never a pleasant topic, errors and exceptions can serve as a powerful tool, illuminating deeper issues in your OpenSearch infrastructure that need to be fixed. Getting acquainted with some of the prevalent failures will not only save you time and effort, but also help ensure the overall health of your OpenSearch cluster.
At Opster, we have analyzed a wide range of OpenSearch problems to understand what caused them. In this blog post, we’ll explain why some OpenSearch errors and exceptions occur and how to avoid them, and review some general best practices that can help you identify, minimize, and handle these issues with greater efficiency.
Let’s start by taking a look at some of the recurring errors and exceptions that most OpenSearch users are bound to encounter at one point or another.
OpenSearch relies on mapping, also known as schema definitions, to handle data properly, according to its correct data type. In OpenSearch, mapping defines the fields in a document and specifies their corresponding data types, such as date, long, and string.
In cases where an indexed document contains a new field without a defined data type, OpenSearch uses dynamic mapping to estimate the field’s type, converting it from one type to another when necessary. If OpenSearch fails to perform this conversion, it will throw the “mapper_parsing_exception failed to parse” exception. Too many of these exceptions can decrease indexing throughput, causing delays in viewing fresh data.
To avoid this issue, you can specify the mapping for a type immediately after creating an index. Alternatively, you can add a new mapping with the /_mapping endpoint. Note that while you can add to an existing mapping, you cannot change existing field mappings. This would cause the data that is already indexed to be unsearchable. Rather, to make the change properly, you need to reindex the entire index.
It’s often more efficient to index large datasets in bulk. For example, instead of using 1,000 index operations, you can execute one bulk operation to index 1,000 docs. This can be done through the bulk API. However, this process is prone to errors and requires you to carefully check for possible problems, such as mismatched data types and nulls.
When it comes to bulk APIs, you need to be extra vigilant, as even if there were hundreds of positive responses, some of the index requests in the bulk may have failed. So, in addition to setting up your bulk API with all the proper conditions ahead of time, go through the list of responses and check each one to make sure that all of your data was indexed as expected.
3. Search Timeout Errors: ConnectionTimeout, ReadTimeoutError, RequestTimeout, and More
If a response isn’t received within the specified search time period, the request fails and returns an error message. This is called a search timeout. Search timeouts are common and can occur for many reasons, such as large datasets or memory-intensive queries.
To eliminate search timeouts, you can increase the OpenSearch request timeout (the default is 30 seconds), reduce the number of documents returned per request, reduce the time range, tweak your memory settings, and optimize your query, indices, and shards. You can also enable slow search logs in order to monitor search run time, scan for heavy searches, and more.
4. All Shards Failed
When searching in OpenSearch, you may encounter an “all shards failed” error message. This happens when a read request fails to get a response from a shard. The request is then sent to a shard copy. After multiple request failures, there may be no available shard copies left. This can happen when the data is not yet searchable because the cluster or node is still in an initial start process, or when the shard is missing or in recovery mode and the cluster is red.
Many issues can cause this: the node may be disconnected or rejoined; the shards being queried may be in recovery and, therefore, not available; the disk may have been corrupted; a search may have been poorly written (for example, referring to a field with the wrong field type); or a configuration error may be causing an operation to fail.
5. Process Memory Locking Failed: “memory locking requested for OpenSearch process but memory is not locked”
For your node to remain healthy, you must ensure that none of the JVM memory is ever swapped out to disk. You can do this by setting bootstrap.memory_lock to true. In addition, ensure that you’ve set up memory locking correctly by consulting the OpenSearch documentation.
If OpenSearch is unable to lock memory, you will encounter this error message: “memory locking requested for OpenSearch process but memory is not locked.” This can happen when a user running OpenSearch doesn’t have the right permissions. These permissions can be granted by setting unlimit -1 to unlimited as root before starting OpenSearch, or by setting memelock to unlimited in /etc/security/limits.conf. Afterward, set MAX_LOCKED_MEMORY to unlimited and LimitMEMLOCK to infinity. This will prevent OpenSearch from becoming non-responsive and help avoid large GC pauses.
6. OpenSearch Bootstrap Checks Failed
Bootstrap checks inspect various settings and configurations before OpenSearch starts to make sure it will operate safely. If bootstrap checks fail, they can prevent OpenSearch from starting (if you are in production mode) or issue warning logs in development mode. It’s recommended to familiarize yourself with the settings enforced by bootstrap checks, noting that they are different in development and production modes. By setting the system property es.enforce.bootstrap.checks to true, you can avoid bootstrap checks altogether.
In OpenSearch, the transport module refers to communication between nodes in a cluster and is used for every call that goes from one node to another. Transport errors are generic, and failures can be due to anything ranging from missing shards, conflicting settings, poorly structured content, network failures, and missing headers.
There are different types of transport errors. One error message—“TransportError(403, u’cluster_block_exception’, u’blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];’)”—can occur when indexes become read only. This can happen when there isn’t enough available disk space for OpenSearch to allocate and relocate shards to and from nodes. To solve this particular issue, you can increase your disk space, delete old data to free up space, or update your index read-only mode.
Another type of transport error can appear when you try to use an index that was just created, before all the shards were allocated. In this case, you will get a TransportError(503, u”). Transport errors can also be linked to problems with mapping. For example, TransportError (400, u’mapper_pasing_exception’) can occur when you attempt to index a field with a data type that is different than its mapping.
8. Initialization/Startup Failures
Sometimes, seemingly trivial issues can prevent OpenSearch from starting. For instance, when using conflicting versions of OpenSearch, you may get error messages such as “OpenSearch java client initialization fails” or “\Common was unexpected at this time.”
How to Minimize Errors and Exceptions: Dealing with the Deeper Issues at Play
If you look beyond tackling one error message at a time, you’ll begin to notice that many errors and exceptions are linked to one of three deeper causes: issues with setup and configuration, indexing new information, or cluster slowness. Let’s take a look at some basic guidelines for tackling these problems.
- Setup and configuration: It’s easy to set up OpenSearch quickly, but making sure it’s production grade requires mindfully configuring your settings. This can help avoid a broad range of errors and exceptions, such as bootstrap checks failure.
- Indexing new information: In OpenSearch, you must use templates properly, know the scheme structure, and carefully name your variables accordingly. Paying careful attention to these parameters can help you avoid issues like mapping exceptions and bulk index errors.
- Cluster slowness: As operations begin to scale, OpenSearch can sometimes slow down unexpectedly, with timeout errors popping up left and right. For this reason, it is crucial that you constantly monitor the activity of your cluster—observing error rate, error logs, and rejected metrics to make sure everything is operating as expected.
Errors and exceptions are bound to arise while operating OpenSearch. Although you can’t avoid them completely, there are some best practices you can employ to help reduce them and to solve problems more efficiently when they do arise. These include paying close attention to your initial setup and configuration and being particularly mindful when indexing new information. In addition, you should have strong monitoring and observability in your system, which is the first basic component of quickly and efficiently getting to the root of complex problems like cluster slowness. In short, instead of dreading their appearance, you can treat errors and exceptions as an opportunity to optimize your OpenSearch infrastructure.
To easily solve OpenSearch errors, we recommend you try AutoOps for OpenSearch. AutoOps diagnoses issues in OpenSearch based on hundreds of metrics pulled by a lightweight agent. Once diagnosed, the system not only provides root cause analysis, but also resolves the issues. Try it for free.