5 Reasons Why Your Elasticsearch Is Not Production Ready

November , 2019Reading Time:3:45 MinOri Shafir

Elasticsearch, an open-source, full-text search engine, allows for massive volumes of data to be stored, searched, and analyzed rapidly in near real-time. As one of the most popular search engines, it is used by companies of all sectors and sizes, including giants such as Facebook, Netflix, and LinkedIn. Elasticsearch is employed behind the scenes, integrating with backend infrastructure where it provides the underlying technology that powers applications.

Elasticsearch teams have made a tremendous effort in designing Elasticsearch so that it can be set up fairly quickly and reliably, without having to invest much thought in its initial configuration. When a new cluster is first created, the scale is usually small, and everything runs smoothly out-of-the-box.

However, unforeseen complications begin to arise once the Elasticsearch cluster begins to scale. As the cluster is loaded with more and more data, and indexing and searches are run more frequently, companies begin to experience severe problems such as outages, degraded performance, data loss, and security breaches. Too often, by the time a company realizes that Elasticsearch requires additional resources, time, and/or expertise, it has already become a central component of their operations.

At Opster, we’ve seen many potentially disastrous mistakes made when working with Elasticsearch. In this blog post, we present five major concerns that should be addressed before your Elasticsearch, whether already in production or not, can be considered truly production-ready.   

Neglecting to Look Inside 

It’s enticing to deploy Elasticsearch and just forget about its inner workings. But, because Elasticsearch can suddenly slow down, nodes can get disconnected, and systems can even crash unexpectedly. Without proper monitoring and observability, you won’t know why this happened, how it can be fixed, or how to avoid the problem in the future.

Monitoring and observability are critical, not just for when things break down, but also for the relentless optimization required of enterprises that wish to maintain their competitive edge. While monitoring reveals whether or not a system is operating as expected, it can’t improve current performance, and it doesn’t explain why something isn’t working the way it should. This is where observability comes in.

Observability gives an end-to-end view of processes, detecting undesirable behavior (such as downtime, errors, and slow response time) and identifying the root causes of problems.

Observability is achieved using logs, metrics, and traces—three powerful tools that are often referred to as the three pillars of observability.

When complex distributed systems start to malfunction, good visibility is crucial for pinpointing the root of the problem and significantly reducing time to resolution. The Elasticsearch community provides free open-source monitoring tools that can help enhance visibility, such as Cerbro and Prometheus ES exporter.

Misconfigured Circuit Breakers  

In Elasticsearch, circuit breakers are used to limit memory usage so that operations do not cause an OutOfMemoryError. Sometimes, a modest adjustment to your circuit breakers can make the difference between high-performing clusters and detrimental downtime. Elasticsearch queries, whether initiated directly by users or by applications, can become extremely resource-intensive. While the default circuit breaker settings may be adequate in some cases, often adjusting breaker limits is absolutely necessary to ensure that queries do not impede performance or cause outages due to running out of memory (OOM).

Plaid, a financial technology company, learned this lesson the hard way. In early 2019, Plaid experienced recurring outages that lasted for over two weeks. Plaid’s Elasticsearch cluster would crash several times a week as a result of dying data nodes, and these crashes impaired the work of multiple teams. After rigorous investigation, Plaid found that the events were caused by memory-intensive queries. After configuring the circuit breakers, i.e., setting a memory usage cap on individual queries, the queries that used to lead to OOM simply gave an error message rather than crashing the cluster.

Poorly Configured Security Settings  

It’s dangerously easy to misconfigure Elasticsearch security settings. If you are not proactive about your security settings, your Elasticsearch database can be exposed or leaked. Common security oversights include exposing the Elasticsearch rest API to the public internet, not changing default passwords, and neglecting to encrypt data in transfer or at rest. These oversights can leave Elasticsearch servers vulnerable to malware or ransomware and subject data to theft or corruption.

These threats are painstakingly real. In 2019 alone, there were numerous data breaches that spanned multiple sectors and countries. It has recently been uncovered that the personal information of 2.5 million customers of the multinational cosmetics brand Yves Rocher was leaked from an unsecure Elasticsearch database. The exposed information included personal addresses, phone numbers, and purchase histories that could easily be exploited by hackers.  In August 2019, Chile’s Electoral Service confirmed that the voter information of over 14.3 million citizens of Chile, accounting for 80% of the country’s population, was exposed online from an Elasticsearch database. The data included sensitive information such as citizens’ names, home addresses, ages, genders, and tax ID numbers. In July, Honda accidently exposed 40GB of critical company data, including information about security systems, networks, and technical data that could pave the way for a major cyberattack on the company. The data leak was a product of misconfigured permissions and poor employee training. Earlier this year, over 24 million financial records belonging to some of the largest US banks were temporarily exposed on an open Elasticsearch server. The records contained highly sensitive information such as social security numbers, addresses, and phone numbers as well as mortgage, loan, and credit card reports. These examples are just the tip of the iceberg.

Even if your Elasticsearch is configured properly with optimal security settings, unprotected Kibana instances can still compromise your data. Part of the ELK stack, Kibana is an open-source project that performs data analytics and visualization of Elasticsearch data. The platform performs advanced analytics on data that it pulls from Elasticsearch databases, which it presents graphically through charts, tables, and maps. The problem is that Kibana isn’t equipped with comprehensive built-in security, especially when being used with the free open-source version of Elasticsearch.

Thousands of Kibana instances are currently exposed on the internet. Hackers can use these instances to gain access to company databases through the Kibana dashboard. It is, therefore, imperative for companies to secure their exposed instances with passwords, make sure all software is up-to-date, and monitor existing servers to make sure private data is not leaking.

Disks and Data Loss

Developer forums are filled with confusion about lost data nodes and unassigned shards in Elasticsearch. This calls to attention the necessity of handling disks mindfully to avoid losing data. If you’re not careful when selecting disks for your data nodes, you might find that shards are unassigned and that data is lost after restart. Ensure that data and master-eligible nodes are using persistent storage. 
in the case of ephemeral disks, however, this is not enough. It is common to select ephemeral disks for their high performance and cost-efficiency; but, without taking the proper precautions, this choice can lead to data loss. When using ephemeral disks, you must have more than one copy of each shard and have a reliable procedure in place to restore data in case all copies are gone.

In the case of ephemeral disks, however, this is not enough. It is common to select ephemeral disks for their high performance and cost-efficiency; but, without taking the proper precautions, this choice can lead to data loss. When using ephemeral disks, you must have more than one copy of each shard and have a reliable procedure in place to restore data in case all copies are gone.

Neglecting Backup and Restore

Although everyone agrees that backup and restoration are important, many companies do not have sufficient backup and restore strategies in place for their Elasticsearch clusters.

There’s a lot to take into account when protecting data in Elasticsearch. For starters, you should make sure that all your important information is backed up. This may seem obvious, but, because indices are added constantly, you may not have snapshots of all your vital indices, backup may not run as often as it should, and backup processes may fail silently—oversights that you may only discover after it’s too late. Keep in mind that running backup procedures is resource-intensive, so it should be done when the cluster is less loaded.

Even if your backup appears to be running perfectly, you should periodically execute restore procedures to make sure that the data is truly restorable. This can be very time-consuming, so it is advisable to predetermine the order of restoration, ensuring that the most vital data is taken care of first.   

Sometimes it’s wiser not to use backup and restore at all. When Elasticsearch mirrors another data source, i.e., it is not the single point of truth, it might be advisable to reconstruct the indices from scratch by reindexing data from the other single point of truth.This might take longer, depending on the nature of the data, but it can take the load off your Elasticsearch backup processes, mitigating costs and reducing storage space.           

One Last Tip: Make Sure you Have the Right Elasticsearch License

Elasticsearch offers three major subscription types: the OSS (Apache license 2.0), which is free and not limited, a basic XPack license, which is free and limited, and a paid XPack Elastic license, which provides access to additional features and capabilities such as SQL functionality and machine learning. 

It’s important to consciously choose and manage your Elasticsearch license type to ensure that it is best suited for your needs. Serious problems can arise when you have the wrong license type. Your company may be legally bound to have an Elasticsearch license, and, without it, you may be subject to a lawsuit. Even if you have a paid license, the license is still limited, which means you still run the danger of using Elasticsearch illegally, exposing your company to legal complications. Finally, you may have purchased a subscription that is too expensive for your budget. By the time you realize this, your operations may already depend on paid Elasticsearch features.

Summary

Elasticsearch is a powerful and widely-used search engine that is at the core of many of today’s technological platforms. It may be easy to manage at first, but as your business scales, you will encounter serious problems if you have not taken some necessary precautions. To ensure that your Elasticsearch is fully prepared for production, it’s imperative that you avoid the major pitfalls detailed above.

Join Our Newsletter