Elasticsearch Elasticsearch categorize_text Aggregation

By Opster Expert Team - April 2023

Updated: Jan 28, 2024

| 7 min read

Quick links

Overview

In version 7.16, Elastic introduced a new type of aggregation called categorize_text. It is part of Elasticsearch’s machine learning feature, and its goal is to help analyze and extract categories from semi-structured, usually machine-generated, text.

Typical use cases

You would typically use this aggregation on text fields that are machine-generated. So, the obvious examples here are log files generated either by services (databases, Kubernetes, Elasticsearch, Kafka, etc.) or by APM systems integrated with your application. A typical example might be as follows:  

User user123 logged on
User john.phillips logged on
User tim@dark.net logged on

All of these messages are different, but the categorize_text aggregation will identify that they are, in fact, the same basic message template and will categorize them as a single log type.

Being machine-generated means that the text has a semi-structured nature so its form is stable enough to be parseable by a regular expression. So, although there are variable tokens (the user names in the example above, but they could equally be dates, keywords, IDs, etc.), they will be ignored because they are merely values that define one instance of that log category, and what we want here is to bucket the instances into a single log type category (in this case, the log on message).

Practical examples for using this aggregation include:

  • Any service log files
  • SQL statements logged by your application, its ORM solution, or maybe an integrated APM system

Examples of usage

Now that we know what the categorize_text aggregation is used for, let’s try to work with some examples so we can demonstrate how to use it. 

First, we need some data. If you want to follow along you can get the two datasets we’re going to use in this demo here and here. They were both made just to help with this demonstration, meaning they don’t hold real data and were not produced by a real service or system. If you are interested in how they were generated, we created a Python script and then used the Faker library to get fake data like email addresses, IP addresses, user agents, and so on.

The first dataset simulates the log output for a hypothetical web server. You can see an excerpt of this log below:

2022-11-16 01:57:03 [WARNING] request: "OPTIONS /list HTTP/1.1", client: 77.149.21.145, server: 173.200.74.11, user: "vegarichard@yahoo.com"
2022-01-16 18:13:41 [WARNING] request: "PUT /app HTTP/1.1", client: 65.93.131.177, server: 30.140.127.246, user: "eddiepowers@lee-bell.net"
2022-02-17 13:15:57 [DEBUG] request: "PUT /tags/blog/list HTTP/1.1", client: 31.18.186.220, server: 19.68.96.117, user: "clayzachary@guzman-mccullough.info"
2022-12-11 17:39:28 [INFO] request: "CONNECT /categories/blog/wp-content HTTP/1.1", client: 188.141.121.107, server: 29.100.89.93, user: "melanie10@hotmail.com"
2022-07-18 18:01:44 [ERROR] request: "POST /explore/blog/search HTTP/1.1", client: 220.227.59.36, server: 213.121.121.201, user: "cervantescameron@williams.com"
2022-12-31 18:11:45 [TRACE] request: "DELETE /category/category HTTP/1.1", client: 219.159.28.17, server: 123.105.220.250, user: "msmith@yahoo.com"
2022-10-21 18:19:00 [TRACE] request: "POST /tags/main/wp-content HTTP/1.1", client: 192.207.200.25, server: 214.159.55.23, user: "kelly00@young.biz"
2022-05-26 21:42:16 [WARNING] request: "PATCH /posts/explore/posts HTTP/1.1", client: 142.11.65.77, server: 141.58.138.195, user: "jason48@monroe-sanders.biz""}
2022-08-27 19:54:32 [WARNING] request: "CONNECT /categories/category HTTP/1.1", client: 202.60.57.6, server: 64.215.252.229, user: "jeremyross@ray-reed.com"
2022-11-01 06:16:33 [INFO] request: "POST /categories/tag HTTP/1.1", client: 124.161.65.165, server: 146.180.200.161, user: "tylerthomas@gmail.com"
2022-11-09 15:20:16 [TRACE] request: "DELETE /tag/search/explore HTTP/1.1", client: 80.239.200.243, server: 208.86.72.248, user: "johnsonstephanie@snyder.biz"
2022-11-13 03:31:54 [DEBUG] request: "POST /app/tags HTTP/1.1", client: 55.133.191.214, server: 88.6.59.40, user: "marysilva@gmail.com"

As you can see, the log is very simple. It merely logs the date, the log level, the HTTP request, client and server IP addresses, and some email addresses representing the logged user. 

It is a good example of semi-structured data on top of which you might be interested in running a categorize_text aggregation so you can easily create buckets for each log category.

The second dataset we’re going to use tries to simulate the log output of the backend of a legacy eCommerce platform. Let’s imagine the company has some kind of in-house APM system that logs each SQL statement that the backend component issues to the database. Not only do the SQL statements get logged but also the duration they took. Take a look below at an excerpt that demonstrate the kind of logs this system generates: 

{"log": "DELETE FROM carts WHERE user_id = 'd755d542-19cf-4e66-8ebd-1fa4854dcea8' AND pending = 'Y'", "duration": 221}
{"log": "DELETE FROM carts WHERE user_id = '1d559bb4-0825-4849-ad64-2c83840c5628' AND pending = 'Y'", "duration": 380}
{"log": "SELECT SUM(price) AS total FROM carts WHERE id = 'f8dc6fb8-88bc-44cc-9578-b6a202eaf4c7'", "duration": 347}
{"log": "UPDATE carts SET quantity = 10 WHERE id = 'aef6d8c3-89d0-4ee9-a637-88b4ec4437fa' AND product_id = 'product_id'", "duration": 336}
{"log": "SELECT id, description, price FROM products WHERE description LIKE '%electric guitar%'", "duration": 305}
{"log": "UPDATE products SET price = 49.48 WHERE product_id = '2a354c51-b395-4390-b2d5-13e4eff967f7'", "duration": 221}
{"log": "DELETE FROM carts WHERE user_id = '50f4243c-7c19-400f-acd8-c979175d4070' AND pending = 'Y'", "duration": 275}
{"log": "SELECT id, description, price FROM products WHERE description LIKE '%tennis racket%'", "duration": 328}
{"log": "UPDATE products SET price = 69.29 WHERE product_id = '7758fa48-aefd-44f4-9be1-5124e6dc0f67'", "duration": 235}
{"log": "UPDATE products SET price = 8.16 WHERE product_id = '31ab6131-7cf9-427e-b8e2-24a7eaa7e264'", "duration": 225}
{"log": "SELECT id, description, price FROM products WHERE description LIKE '%vegan cooking book%'", "duration": 281}
{"log": "UPDATE products SET price = 778.26 WHERE product_id = 'db6e2c6c-d1a6-4b78-83d2-4a45684d6e0d'", "duration": 282}
{"log": "SELECT id, description, price WHERE category_id = '9513eb64-8ddf-4266-b9af-cc5a140da751' ORDER BY price DESC", "duration": 305}
{"log": "SELECT AVG(price), MAX(price), MIN(price) FROM products WHERE description LIKE '%baseball cap%' OR category_id = '5b7a8459-8c34-40d7-a513-9b4c7b41ac92' OR total_sold > 1000 OR id IN (SELECT product_id FROM whishlists WHERE user_id = 'bcc4580a-eea0-4e48-bc8b-9427dde046d2')", "duration": 2114}
{"log": "UPDATE products SET price = 73.28 WHERE product_id = 'd6cfe7d7-265b-4e2f-b287-fa2c18675f9e'", "duration": 307}
{"log": "SELECT id, description, price WHERE category_id = '9358df7f-4576-4838-8d53-b26a4f3455db' ORDER BY price DESC", "duration": 237}
{"log": "SELECT AVG(price), MAX(price), MIN(price) FROM products WHERE description LIKE '%suitcase%' OR category_id = 'fa588c04-b91e-48a3-a5c0-cbd16c6f4eb7' OR total_sold > 1000 OR id IN (SELECT product_id FROM whishlists WHERE user_id = '702cd372-c885-426c-be8e-6388d3b9b7df')", "duration": 2363}
{"log": "SELECT id, description, price WHERE category_id = '04b04a7c-f47b-4801-82b6-9743fec3549b' ORDER BY price DESC", "duration": 388}
{"log": "UPDATE carts SET quantity = 10 WHERE id = '9d9bc00f-b263-4f8e-aa25-ceadd609a3d8' AND product_id = 'product_id'", "duration": 262}
{"log": "SELECT AVG(price), MAX(price), MIN(price) FROM products WHERE description LIKE '%tennis racket%' OR category_id = '791c1c09-fe5a-465c-9d3a-3a2dd9cce546' OR total_sold > 1000 OR id IN (SELECT product_id FROM whishlists WHERE user_id = '4fbf673c-99d6-4c9b-bc27-d09b2641aaac')", "duration": 1179}
{"log": "DELETE FROM carts WHERE user_id = '4b2e0c0c-a632-4041-a9bf-117b6fa61879' AND pending = 'Y'", "duration": 361}
{"log": "SELECT id, description, price FROM products WHERE description LIKE '%tennis racket%'", "duration": 292}
{"log": "SELECT AVG(price), MAX(price), MIN(price) FROM products WHERE description LIKE '%suitcase%' OR category_id = 'b3b60176-ae26-4012-95ee-dc642776c381' OR total_sold > 1000 OR id IN (SELECT product_id FROM whishlists WHERE user_id = '3ab45e25-7c22-439e-b628-3e548debd6a9')", "duration": 2567}
{"log": "SELECT id, description, price FROM products WHERE description LIKE '%baseball cap%'", "duration": 222}
{"log": "UPDATE carts SET quantity = 7 WHERE id = '2c3d53a3-9a07-439c-a3d0-23c12f5cc251' AND product_id = 'product_id'", "duration": 257}

The first thing we should do is index the data. The provided datasets were actually prepared so you can download them and send them to an Elasticsearch cluster through the _bulk API. The first dataset will index data into a webserver_logs index and the second one into a ecommerce_sql_logs index.
With the query below you will ask Elasticsearch to run the categorize_text aggregation against the log field. Since we are not specifying any custom analyzer, the field will be analyzed with the default categorization analyzer. This analyzer includes a tokenizer called ml_standard, which is built specifically for general machine-generated text.

POST webserver_logs/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "log"
      }
    }
  }
}

And this is the result you’ll get:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1000,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "buckets": [
        {
          "doc_count": 121,
          "key": "request POST HTTP/1.1 client server user",
          "regex": """.*?request.+?POST.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 183
        },
        {
          "doc_count": 118,
          "key": "request OPTIONS HTTP/1.1 client server user",
          "regex": """.*?request.+?OPTIONS.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 190
        },
        {
          "doc_count": 116,
          "key": "request GET HTTP/1.1 client server user",
          "regex": """.*?request.+?GET.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 178
        },
        {
          "doc_count": 115,
          "key": "request HEAD HTTP/1.1 client server user",
          "regex": """.*?request.+?HEAD.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 185
        },
        {
          "doc_count": 113,
          "key": "request DELETE HTTP/1.1 client server user",
          "regex": """.*?request.+?DELETE.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 184
        },
        {
          "doc_count": 110,
          "key": "TRACE request TRACE HTTP/1.1 client server user",
          "regex": """.*?TRACE.+?request.+?TRACE.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 180
        },
        {
          "doc_count": 104,
          "key": "request CONNECT HTTP/1.1 client server user",
          "regex": """.*?request.+?CONNECT.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 190
        },
        {
          "doc_count": 104,
          "key": "request PUT HTTP/1.1 client server user",
          "regex": """.*?request.+?PUT.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 182
        },
        {
          "doc_count": 99,
          "key": "request PATCH HTTP/1.1 client server user",
          "regex": """.*?request.+?PATCH.+?HTTP/1\.1.+?client.+?server.+?user.*?""",
          "max_matching_length": 190
        }
      ]
    }
  }
}

You’ll notice that the aggregation resulted in a set of buckets, each one representing a category of log lines. All the variable tokens, such as IP addresses, emails, and so on, were ignored so the log lines could be reduced to a set of common tokens. 

Elasticsearch also tells you what the regular expression is for matching that category. So, for the first example, you can see that each category actually represents a type of HTTP request. In order to bucket the log lines within categories, it doesn’t matter which IP the request came from or which user generated it. Those factors are treated as variables using regular expressions.

Now, let’s run the same query against our second example:

POST ecommerce_sql_logs/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "log"
      }
    }
  }
}

This is the result:

{
  "took": 31,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1000,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "buckets": [
        {
          "doc_count": 172,
          "key": "SELECT id description price FROM products WHERE description LIKE",
          "regex": ".*?SELECT.+?id.+?description.+?price.+?FROM.+?products.+?WHERE.+?description.+?LIKE.*?",
          "max_matching_length": 97
        },
        {
          "doc_count": 157,
          "key": "UPDATE products SET price WHERE product_id",
          "regex": ".*?UPDATE.+?products.+?SET.+?price.+?WHERE.+?product_id.*?",
          "max_matching_length": 104
        },
        {
          "doc_count": 155,
          "key": "SELECT id description price WHERE category_id ORDER BY price DESC",
          "regex": ".*?SELECT.+?id.+?description.+?price.+?WHERE.+?category_id.+?ORDER.+?BY.+?price.+?DESC.*?",
          "max_matching_length": 118
        },
        {
          "doc_count": 136,
          "key": "SELECT SUM price AS total FROM carts WHERE id",
          "regex": ".*?SELECT.+?SUM.+?price.+?AS.+?total.+?FROM.+?carts.+?WHERE.+?id.*?",
          "max_matching_length": 95
        },
        {
          "doc_count": 130,
          "key": "SELECT AVG price MAX price MIN price FROM products WHERE description LIKE OR category_id OR total_sold OR id IN SELECT product_id FROM whishlists WHERE user_id",
          "regex": ".*?SELECT.+?AVG.+?price.+?MAX.+?price.+?MIN.+?price.+?FROM.+?products.+?WHERE.+?description.+?LIKE.+?OR.+?category_id.+?OR.+?total_sold.+?OR.+?id.+?IN.+?SELECT.+?product_id.+?FROM.+?whishlists.+?WHERE.+?user_id.*?",
          "max_matching_length": 305
        },
        {
          "doc_count": 130,
          "key": "UPDATE carts SET quantity WHERE id AND product_id product_id",
          "regex": ".*?UPDATE.+?carts.+?SET.+?quantity.+?WHERE.+?id.+?AND.+?product_id.+?product_id.*?",
          "max_matching_length": 121
        },
        {
          "doc_count": 120,
          "key": "DELETE FROM carts WHERE user_id AND pending Y",
          "regex": ".*?DELETE.+?FROM.+?carts.+?WHERE.+?user_id.+?AND.+?pending.+?Y.*?",
          "max_matching_length": 99
        }
      ]
    }
  }
}

Once again, you can see how Elasticsearch has identified the basic log message and the regular expression used to deal with the variable parts of the messages.

Now that we have seen what the categorize_text aggregation is capable of, let’s take a step further and add a stats aggregation to our original query:

POST ecommerce_sql_logs/_search
{
  "size": 0,
  "aggs": {
    "categories": {
      "categorize_text": {
        "field": "log"
      },
      "aggs": {
        "stats": {
          "stats": {
            "field": "duration"
          }
        }
      }
    }
  }
}

Here’s the result:

{
  "took": 96,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1000,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "categories": {
      "buckets": [
        {
          "doc_count": 172,
          "key": "SELECT id description price FROM products WHERE description LIKE",
          "regex": ".*?SELECT.+?id.+?description.+?price.+?FROM.+?products.+?WHERE.+?description.+?LIKE.*?",
          "max_matching_length": 97,
          "stats": {
            "count": 172,
            "min": 201,
            "max": 398,
            "avg": 302.94767441860466,
            "sum": 52107
          }
        },
        {
          "doc_count": 157,
          "key": "UPDATE products SET price WHERE product_id",
          "regex": ".*?UPDATE.+?products.+?SET.+?price.+?WHERE.+?product_id.*?",
          "max_matching_length": 104,
          "stats": {
            "count": 157,
            "min": 200,
            "max": 399,
            "avg": 293.96178343949043,
            "sum": 46152
          }
        },
        {
          "doc_count": 155,
          "key": "SELECT id description price WHERE category_id ORDER BY price DESC",
          "regex": ".*?SELECT.+?id.+?description.+?price.+?WHERE.+?category_id.+?ORDER.+?BY.+?price.+?DESC.*?",
          "max_matching_length": 118,
          "stats": {
            "count": 155,
            "min": 200,
            "max": 400,
            "avg": 301.9225806451613,
            "sum": 46798
          }
        },
        {
          "doc_count": 136,
          "key": "SELECT SUM price AS total FROM carts WHERE id",
          "regex": ".*?SELECT.+?SUM.+?price.+?AS.+?total.+?FROM.+?carts.+?WHERE.+?id.*?",
          "max_matching_length": 95,
          "stats": {
            "count": 136,
            "min": 201,
            "max": 398,
            "avg": 302.2352941176471,
            "sum": 41104
          }
        },
        {
          "doc_count": 130,
          "key": "SELECT AVG price MAX price MIN price FROM products WHERE description LIKE OR category_id OR total_sold OR id IN SELECT product_id FROM whishlists WHERE user_id",
          "regex": ".*?SELECT.+?AVG.+?price.+?MAX.+?price.+?MIN.+?price.+?FROM.+?products.+?WHERE.+?description.+?LIKE.+?OR.+?category_id.+?OR.+?total_sold.+?OR.+?id.+?IN.+?SELECT.+?product_id.+?FROM.+?whishlists.+?WHERE.+?user_id.*?",
          "max_matching_length": 305,
          "stats": {
            "count": 130,
            "min": 1003,
            "max": 2972,
            "avg": 1922.2923076923078,
            "sum": 249898
          }
        },
        {
          "doc_count": 130,
          "key": "UPDATE carts SET quantity WHERE id AND product_id product_id",
          "regex": ".*?UPDATE.+?carts.+?SET.+?quantity.+?WHERE.+?id.+?AND.+?product_id.+?product_id.*?",
          "max_matching_length": 121,
          "stats": {
            "count": 130,
            "min": 202,
            "max": 400,
            "avg": 304.82307692307694,
            "sum": 39627
          }
        },
        {
          "doc_count": 120,
          "key": "DELETE FROM carts WHERE user_id AND pending Y",
          "regex": ".*?DELETE.+?FROM.+?carts.+?WHERE.+?user_id.+?AND.+?pending.+?Y.*?",
          "max_matching_length": 99,
          "stats": {
            "count": 120,
            "min": 201,
            "max": 395,
            "avg": 305.075,
            "sum": 36609
          }
        }
      ]
    }
  }
}

As you can see, the average duration of almost all categories is around 300ms, but for a complex query in the middle of the output, the duration is 1,922ms. Now that we have found our performance bottleneck, we are able to take steps to improve the query performance.

Things to be aware of

This aggregation is part of Elasticsearch’s machine learning feature, so you cannot test it if you only have a basic license. You’ll have to start a trial period or maybe acquire a subscription on Elastic Cloud.

If you are running a self-managed cluster, you should be aware that the algorithm for this aggregation changed in version 8.3.

This aggregation actually re-analyzes all your data, so it is no surprise that it is very resource intensive. That’s why Elastic suggests that you use it as an async search. Also, to avoid having to pass all of your data through the analysis process all over again, consider running the categorize_text aggregation on just a sample of your data by defining it as a child of the sampler or diversified sampler aggregations.

Conclusion

Text categorization has a wide range of interesting use cases. You could produce histograms using the categorize_text aggregation with Kibana Vega or, even better, integrate it inside an anomaly detection machine learning job.  

In this article, we have just touched the tip of what you can do with the categorize_text aggregation. There are many more advanced configurations you could set up for it, especially when it comes to the analysis process, but that’s something that will be addressed in a future article, which will discuss more advanced aspects of the categorize_text aggregation.

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?