Like many operations running in the cloud, when operating Elasticsearch in AWS, you can choose to take the managed or unmanaged route. With an unmanaged approach, the customer takes full responsibility for everything, from infrastructure to applications and services. This includes creating and deploying an ELK stack and using AWS core services such as S3, RDS, and SQS.
Alternatively, you can employ the Amazon Elasticsearch Service, which manages most tasks for you, thus facilitating the deployment, operation, and scaling of Elasticsearch clusters in the AWS cloud.
In this blog post, we’ll explore both options, examining available features, capabilities, costs, and limitations.
Setup and Configuration
Responsibilities of the Unmanaged Route
Those who choose to manage Elasticsearch on their own are entirely responsible for setting things up from A to Z, including network infrastructure, OS, disks, JVM, orchestration tools, backup and restore procedures, and security. These tasks can be quite complex, requiring a great deal of time and expertise.
To demonstrate the complexity at hand, let’s look at what’s involved when deploying the ELK Stack on Amazon ECS.
First, you must customize, build, and deploy an Elasticsearch container to a private ECR registry. Second, you need to customize and deploy an ECS cluster that can run the ELK container that you set up. This requires defining multiple tasks and services.
Setting up all the necessary tasks and services involves tinkering with many settings, such as memory limits and port mapping. And if the settings aren’t tuned correctly, this could lead to performance issues. For instance, if you don’t set up cleanup procedures for your logs and regularly monitor your memory usage, the disk space may run out and cause an outage. Other settings, such as dynamically assigning a host port to multiple containers, can harm mission-critical parameters like high availability.
Using AWS Managed Service
One of the greatest advantages of the AWS managed service is that it renders setup and configuration easy and immediate. With the AWS Command Line Interface (CLI) or the AWS SDK, you can easily deploy Elasticsearch, selecting the desired number of instances, instance types, and storage options. Once selected, the service does the rest—setting up the domain, provisioning infrastructure capacity, and installing Elasticsearch software.
Once the cluster is up and running, Amazon Elasticsearch Service fully manages resources and performance through automated administrative tasks, including hardware provisioning, automatic daily backups, cluster recovery after failure, and version upgrades. Managing resources is simple—with straightforward drop-down menus for adjusting instance size and other parameters.
AWS Elasticsearch Service monitors, visualizes, and analyzes certain key metrics in real time. However, alerts and events must be built from scratch with CloudWatch or set up through Open Distro alerting. In some cases, Open Distro alerts do not cover essential system statistics unless they are collected and sent to the current Elasticsearch deployment.
Setup and Configuration Drawbacks
Although this service integrates seamlessly with other AWS services, it only supports a very limited set of plugins. And some of the missing plugins are vital for expanding Elasticsearch capabilities and overcoming AWS limitations.
For example, the AWS service does not enforce security features and doesn’t allow for the configuration of access permissions. As a result, all Elasticsearch users have full data privileges and can accidentally delete other users’ data or create problems by changing settings. The way to get around this is to use Open Distro security, which provides some of the X-Pack capabilities.
It’s tricky to compare the overall costs of an unmanaged and managed Amazon Elasticsearch Service. Cost is a function of many different factors, for example, the amount of data, backup needs, transfer costs, and the ability to plan ahead, as well as time and effort
Although the Amazon Elasticsearch Service does not require upfront fees or minimum usage requirements, its costs can become quite high. As operations begin to scale, the managed service can become particularly costly. At this stage, you may find yourself in a kind of “vendor lock-in,” making it difficult to transition back to an unmanaged environment.
Moreover, while automated snapshots are stored in Amazon S3 for free, any additional manual snapshots will be stored and subject to Amazon S3 pricing. Yet another thing to take into account is that if you can plan ahead, choosing Reserved Instances in advance, as opposed to the on-demand option, can lower costs significantly.
At face value, the unmanaged option is far cheaper, costing roughly 30-40%—sometimes even up to 50%—less than the managed option. Without the AWS service however, you’ll need to find a way to manage the infrastructure on your own, handling provisioning, monitoring, and finding adequate tools for observability and troubleshooting. If your team doesn’t have the resources to set things up correctly, optimize operations, and solve issues quickly, costs could pile up as well.
What About the Data?
Another factor influencing costs is the amount of data. If you have less than 5 GB of data, the service may be quite expensive, especially if you have multiple small clusters for isolation purposes. For each cluster, AWS requires three dedicated master nodes to ensure a stable cluster. As a result, storing documents below 5 GB may not be worth the price. You get more bang for your buck if you have over 1 TB of data.
Monitoring, System Optimization, and Maintenance
Monitoring is a challenging and important issue in both the managed and unmanaged scenarios. Let’s take a look.
On the managed side, first off, AWS Elasticsearch Service has limited access to administrative APIs, logs, and metrics. Although it provides aggregate metrics at the cluster level, it lacks some important node-level metrics and query logs. This is incredibly limiting when it comes to troubleshooting and pinpointing the root cause of any issues that arise.
When monitoring Elasticsearch on your own, you may have more freedom and greater access to system metrics and logs, but you’ll still be tasked with finding a good monitoring tool and knowing how to make the best use of it. Either way, system visibility is critical for the next step—troubleshooting and performance optimization.
An important element of managing Elasticsearch properly is the ability to adjust configurations in order to optimize performance. This points to a disadvantage in the AWS service, which only supports a limited set of operations and configuration changes. Among the functions missing are altering important performance factors (e.g., thread pool and query cache sizes) and basic functions like reindexing from a remote cluster (via reindex.remote.whitelist).
Finally, the AWS service takes care of updates for you, allowing you to conveniently track progress without having to get involved. However, a significant drawback is AWS’s use of blue-green deployments to do so. In this method, all the data of the cluster is copied to the new nodes, after which the old cluster is destroyed. This process may take days, exceed the cluster’s capacity, or even cause it to crash mid-operation. As a result, upgrades, rollouts, maintenance, and even the tiniest update can become time-consuming and expensive in large deployments.
When managing Elasticsearch on your own, these operations are achieved through a rolling restart. This is more complex, but on the whole, takes much less time and is more efficient.
Mission-Critical Operations: The Case of Downtime
A built-in disadvantage of managed services is that it increases your dependence on external support teams. When the production environment goes down, you’re completely dependent on AWS support, and it may take days until the issue is resolved.
Some have also reported that the answers received from the AWS support team are unsatisfactory. These responses can be too general—such as “too many shards”—or can propose sub-optimal and computationally-expensive solutions—like increasing the instance size of the master nodes, which doubles the size of the entire cluster.
Another problem with the AWS service that increases the risk of downtime is that it does not support shard rebalancing. Thus, if a single node runs out of space, the whole cluster will stop ingesting data. Businesses with mission-critical operations should take these limitations into account, and if they have decided to go with the managed option they might be better off with purchasing premium AWS support.
It’s important to note that dealing with downtime without a managed service is no less challenging. And there are plenty of examples that demonstrate the difficulty of dealing with downtime in Elasticsearch on your own.
So What’s It Going to Be?
Deciding whether to take the unmanaged route or go with the managed Amazon Elasticsearch Service is not a clear-cut choice. While the managed service unburdens users from many complex and time-consuming tasks, it is generally more expensive, has limitations, and is not a foolproof solution for operational issues or even failure.
The managed AWS service is not suited for everyone. But it may be advantageous for small- to medium-sized operations that do not have mission-critical applications or with insufficient knowledge of Elasticsearch to handle things on their own.However, if you have more than 1 TB of data, you may find the AWS service lacks the flexibility needed to properly manage the cluster. Opting to manage everything on your own gives you greater control, but it also requires that you handle the inner workings of Elasticsearch, including implementing observability and troubleshooting tools and being capable of watching out for pitfalls.