Elasticsearch How to Roll Up Data in Elasticsearch

Elasticsearch How to Roll Up Data in Elasticsearch

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

In addition to reading this guide, we recommend you run the Elasticsearch Configuration Check-Up. The Check-Up will help you optimize important settings in Elasticsearch to improve performance.

Run the Elasticsearch check-up to receive recommendations like this:

checklist Run Check-Up
error

An indexing burst is affecting the performance of the following node

error-img

Description

The node is unable to keep up with indexing requests, and as a result indexing requests are being queued. If the write queue reaches full capacity, index requests will be rejected...

error-img

Recommendation

Based on your specific Elasticsearch deployment, we recommend that you...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]

Why you may want to roll up your data

The cost of running an Elasticsearch cluster is largely relative to the volume of data stored on the cluster. If you are storing time-based data, it’s common to find that old data is queried less often than the newer data, and that the old data is often only used to look at the “bigger picture” or to compare historical trends.  

Rollup jobs provide a way to drastically reduce storage cost for old data, by means of storing documents which summarize the data for a given time period. This means that you maintain the ability to carry out searches on key parameters on that data, albeit with a reduced granularity. For example, you may be storing metrics about CPU and disk usage which are recorded every minute. You could set up a rollup job to summarise this data in hourly buckets.

Define a rollup job

You can define a rollup job using the following:

PUT _rollup/job/metrics
{
  "index_pattern": "metrics-*",
  "rollup_index": "metrics_rollup",
  "cron": "*/30 * * * * ?",
  "page_size": 1000,
  "groups": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "60m"
    },
    "terms": {
      "fields": [ "node", "environment"]
    }
  },
  "metrics": [
    {
      "field": "cpu",
      "metrics": [ "min", "max", "sum", "avg" ]
    },
    {
      "field": "disk",
      "metrics": [ "avg", "max" ]
    }
  ]
}

Cron

Cron defines the frequency of running the rollup job. You may prefer to spread the load into regular short jobs with a high cron frequency, or run a single long job at 2am, depending on the load profile on your Elasticsearch cluster.

Groups

Groups define the “buckets” into which your data should be summarized. Within groups:

1. Date Histogram

It is obligatory to define a date histogram.  

You must define a “calendar_interval” (eg. 1M for  month)  or “fixed_interval” (eg. 2h, 1d).

Bear in mind that monthly intervals may not give you evenly sized buckets, so it is generally preferable to use fixed intervals.

2. Terms

It is also important to include any other “buckets” which you may want to use to classify your data. For example, you may want to aggregate your data into sub buckets by node or environment. 

It is important to think carefully about which fields you want to include here to enable you to analyze and classify your data. If you leave fields out it will not be possible to subdivide your data on fields which are not included here. However, you should avoid including fields with high cardinality since this will increase the size of the rolled up index on disk.

Metrics

Define the metrics you will want to analyze and the aggregations you require.

Possible values are min, max, sum, avg, and value_count. Again, it will not be possible to query on metrics which are not included here.

Starting and stopping the rollup job

POST _rollup/job//_start   
POST _rollup/job//_stop

Searching rolled up data

You can search rolled up data using exactly the same syntax as you would when searching standard data. Beyond that, you can also search rolled up data combined with regular data indices. Elasticsearch will work out which combination of rolled up data / indices to use to optimise the results. The only thing that you need to do is to use the specific rollup data endpoint. _rollup_search instead of _search.

GET /metrics_rollup/_rollup_search
{
  "size": 0,
  "aggregations": {
    "max_cpu": {
      "max": {
        "field": "cpu"
      }
    }
  }
}



Run the Check-Up to get a customized report like this:

Analyze your cluster