Elasticsearch Aggregation

Elasticsearch Aggregation

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

Run the Check-Up to get customized recommendations like this:

checklist Run Check-Up
error

Heavy merges detected in specific nodes

error-img

Description

A large number of small shards can slow down searches and cause cluster instability. Some indices have shards that are too small...

error-img

Recommendations

Based on your specific ES deployment you should...

1

X-PUT curl -H [a customized recommendation]

Aggregations in Elasticsearch

What is an Elasticsearch aggregation?

In Elasticsearch, an aggregation is a collection or the gathering of related things together. The aggregation framework collects data based on the documents that match a search request which helps in building summaries of the data. Below are the different types of aggregations:

Types of aggregations

  • Bucket aggregations: Bucket aggregations create buckets or sets of documents based on values of fields in the documents. When the aggregation is performed, the documents are placed in the respective bucket(s). This way you can divide a set of invoices into several buckets, one for each customer, system logs can be divided into “error”,”warning” and “info”, or CPU performance data divided into hourly buckets. The output consists of a list of buckets, each with a key and a count of documents. Here are some examples of bucket aggregations: Histogram Aggregation, Range Aggregation, Terms Aggregation, Filter(s) Aggregations, Geo Distance Aggregation and IP Range Aggregation.
  • Metric aggregations: Metric aggregation mainly refers to the mathematical calculations performed across a set of documents, usually based on the values of a numerical field present in the document, such as COUNT, SUM, MIN, MAX, AVERAGE etc. Metrics may be carried out at top level, but are often more useful as a sub aggregation to calculate values for a bucket aggregation.
  • Pipeline aggregations: These aggregations allow you to aggregate based on the result of another aggregation rather than from document sets. Typically this aggregation is used to find the average number of documents in a bucket, or to sort buckets based upon a metric produced by a metric aggregation.

Aggregation syntax

GET /logs-000001/_search
{
  "aggs": {
    "errors": {
      "terms": {
        "field": "log_type"
      }
    }
  }
}

The above query does two things: The “query” part selects a number of documents from the index (the number of documents in the “hits” output, not the actual documents returned), while the aggregation (which is at the same level in the json) will split the document into buckets determined by the contents of the log_type field.

Nesting aggregations

It is possible to nest aggregations inside one another (nothing to do with nested fields), so as to divide the buckets into sub buckets, or to calculate metrics from the sub buckets. The below aggregation will separate out all exam results by gender of the pupil and then calculate the average results for each gender. In this case, the important thing to understand is that the second aggregation will be calculated on the individual set of the bucket rather than the document set as a whole.

POST exam_results*/_search
{
  "size": 0,
  "aggs": {
    "genders": {
      "terms": {
        "field": "gender"
      },
      "aggs": {
        "avg_grade": {
          "avg": {
            "field": "grades"
          }
        }
      }
    }
  }
}

Aggregation performance

Aggregations are typically carried out in RAM memory,  and require a different document access structure than a search query that is obtained from the inverted index, so it is important to consider the implication of performance when constructing your aggregations.  The most important considerations are:

Number of buckets

This would be controlled by the “size” parameter in a terms aggregation, or the “calendar interval” in a date histogram. Bear in mind that where you have bucket aggregations nested at more than one level, then the total number of buckets will be multiplied for each level of aggregation.

Number of documents

When running an aggregation,it is preferable (if possible) to adjust the query so that your aggregation is only performed on a restricted set of those documents that you are interested in, instead of using a match_all query. This will reduce the memory required to run the aggregation.

Fielddata

Aggregations as a rule should always be run on keyword type fields, not analysed text. It is possible to run on analyzed text by using the mapping setting “fielddata”:”true” but this is highly memory intensive and should be avoided if possible.

Run the Check-Up to get customized recommendations like this:

checklist Run Check-Up
error

Heavy merges detected in specific nodes

error-img

Description

A large number of small shards can slow down searches and cause cluster instability. Some indices have shards that are too small...

error-img

Recommendations

Based on your specific ES deployment you should...

1

X-PUT curl -H [a customized recommendation]


Related log errors to this ES concept


BottomRight ; topLeft
Key ; date ; doccount
Key ; doccount
Key ; from ; to ; doccount
Key ; point ; doccount
Percent ; value
ScriptedResult
id ;source node
Key ; date ; doc-count
Key ; doc-count
Key ; from ; to ; doc-count
Key ; point ; doc-count

< Page: 1 of 2 >


Improve Elasticsearch Performance

Run The Analysis