Elasticsearch Elasticsearch Rollup: How to Rollup Data in Elasticsearch

By Opster Team

Updated: Mar 22, 2023

| 2 min read

Why you may want to roll up your data

The cost of running an Elasticsearch cluster is largely relative to the volume of data stored on the cluster. If you are storing time-based data, it’s common to find that old data is queried less often than the newer data, and that the old data is often only used to look at the “bigger picture” or to compare historical trends.  

Rollup jobs provide a way to drastically reduce storage cost for old data, by means of storing documents which summarize the data for a given time period. This means that you maintain the ability to carry out searches on key parameters on that data, albeit with a reduced granularity. For example, you may be storing metrics about CPU and disk usage which are recorded every minute. You could set up a rollup job to summarise this data in hourly buckets.

Define a rollup job

You can define a rollup job using the following:

PUT _rollup/job/metrics
{
  "index_pattern": "metrics-*",
  "rollup_index": "metrics_rollup",
  "cron": "*/30 * * * * ?",
  "page_size": 1000,
  "groups": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "60m"
    },
    "terms": {
      "fields": [ "node", "environment"]
    }
  },
  "metrics": [
    {
      "field": "cpu",
      "metrics": [ "min", "max", "sum", "avg" ]
    },
    {
      "field": "disk",
      "metrics": [ "avg", "max" ]
    }
  ]
}

Cron

Cron defines the frequency of running the rollup job. You may prefer to spread the load into regular short jobs with a high cron frequency, or run a single long job at 2am, depending on the load profile on your Elasticsearch cluster.

Groups

Groups define the “buckets” into which your data should be summarized. Within groups:

1. Date Histogram

It is obligatory to define a date histogram.  

You must define a “calendar_interval” (eg. 1M for  month)  or “fixed_interval” (eg. 2h, 1d).

Bear in mind that monthly intervals may not give you evenly sized buckets, so it is generally preferable to use fixed intervals.

2. Terms

It is also important to include any other “buckets” which you may want to use to classify your data. For example, you may want to aggregate your data into sub buckets by node or environment. 

It is important to think carefully about which fields you want to include here to enable you to analyze and classify your data. If you leave fields out it will not be possible to subdivide your data on fields which are not included here. However, you should avoid including fields with high cardinality since this will increase the size of the rolled up index on disk.

Metrics

Define the metrics you will want to analyze and the aggregations you require.

Possible values are min, max, sum, avg, and value_count. Again, it will not be possible to query on metrics which are not included here.

Starting and stopping the rollup job

POST _rollup/job/<id>/_start   
POST _rollup/job/<id>/_stop

Searching rolled up data

You can search rolled up data using exactly the same syntax as you would when searching standard data. Beyond that, you can also search rolled up data combined with regular data indices. Elasticsearch will work out which combination of rolled up data / indices to use to optimise the results. The only thing that you need to do is to use the specific rollup data endpoint. _rollup_search instead of _search.

GET /metrics_rollup/_rollup_search
{
  "size": 0,
  "aggregations": {
    "max_cpu": {
      "max": {
        "field": "cpu"
      }
    }
  }
}

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?