Elasticsearch Elasticsearch match_only_text Field Type (For Storage Optimization)

Elasticsearch Elasticsearch match_only_text Field Type (For Storage Optimization)

Opster Expert Team - Gustavo

Sep-2021

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

To evaluate your disk space and field types in Elasticsearch, we recommend you run the Elasticsearch Configuration Check-Up. The Check-Up will also help you optimize other important settings and processes in Elasticsearch to improve performance and ensure high availability for your crucial data.

Overview

A new feature of Elasticsearch 7.14 is the new match_only_text that can save up to 10% of disk space on logging datasets.

When defining mappings, a trivial decision is whether to set a field as “keyword” or “text”, depending on how we are querying it.

Keyword field

We use keyword fields when we want to look for exact matches and when we want to filter documents, such as showing the user a select box with options (e.g. status = “done”). This also works for operations like aggregations or sorting, where we already know the exact values beforehand. 

The advantage of these fields is that they are quick to search and use minimal storage space. 

Text field

On the other hand, text fields allow us to run full-text queries. These cover: non exact matches, looking for partial words in a field that contains a sentence, case insensitive search, fuzzy searches. Hits on text fields might have different scores depending on how relevant the document is for the search query term that was searched. As keyword fields are mainly used for structured searches (i.e. the match has to be exact like a yes/no match), all results are equally relevant. In a full text query, each result has a score depending on how relevant it is to the search.

The disadvantage of this type of field is that the searches run more slowly and more disk space is used than with keyword fields. 

You can learn more about the limitations of each field here.

match_only_text 

Until now, there was no middle ground between keyword fields and text fields. For example, there used to be no solution for running full text queries on a field but not generating relevance scores for the results. Were we planning on sorting the results by a different criteria than score, we would still have to waste the added time and disk space that the relevance scores would require. 

The most common example for this is log messages. If we want to search through all the error log messages with the words “null pointer” (null or pointer), but then sort by log date, we don’t have any need for scoring (which logs are “a better match”). 

This is exactly when we would use the new match_only_text field type. 

The match_only_text field type will not save the data related to term frequencies and positions on disk, both of which are needed for determining the relevance of each document in the result set. Instead, it will set a flat score.

Let’s see another example: 

We have an e-commerce system and have indexed thousands of product evaluations in this format: 

{
 "product_id": 1,
 "stars": 1,
 "message": "customer service is bad, I will not order again",
 "date": "2021-08-10"
}


{
 "product_id": 1,
 "stars": 5,
 "message": "nice product, great customer service",
 "date": "2021-08-13"
}

We want to know the evolution of satisfaction based on customer service, and then generate a nice Kibana line chart representing this trend. 

The key word is “evolution” – this means we are sorting our results by date. Ratings have no “category” field, so the only way to get only the ratings related to customer service is to do a full text search, but as we learned, we don’t need the relevance feature. That’s why we’ll index our message field as match_only_text field type. 

First we set the mappings:

PUT match-only-text-test
{
 "mappings": {
   "properties": {
     "@timestamp": {
       "type": "date"
     },
     "product_id": {
       "type": "long"
     },
     "stars": {
       "type": "byte"
     },
     "message": {
       "type": "match_only_text"
     },
     "date": {
       "type": "keyword"
     }
   }
 }
}
  • Note that the stars field is set as byte, this is the smallest number field.
  • Note that the date field is set as keyword because we are using it for displaying. For sorting we are using @timestamp, this field is auto generated if you import using Kibana CSV Import.

Now our index is optimized. We can run full text queries using the least space possible. 

Let’s query our data:

GET match-only-text-test/_search
{
 "query": {
   "match": {
     "message": "customer service"
   }
 }
}

Queries that need position, for example match_phrase, are supported by match_only_text, but will generate positions data on the fly, similar to runtime_fields, which results in slower results than a regular text field, trading space for performance.

Let’s look at a query example:

GET match-only-text-test/_search
{
 "query": {
   "match_phrase": {
     "message": "customer service"
   }
 }
}

Positions data is needed to return only documents with “customer service” without any word in between. 

Summary

With match_only_text you can save up to 10% of disk space by simply changing a field type on your mappings. Just make sure you don’t care about the document score, or in other words, the order of relevance between the documents returned. 

Consider the following limitations when using match_text_only field type:

If you only care about matching text in an unstructured field and will later sort the data according to different parameters, then using match_only_text is likely your best option.

BONUS 

The Kibana chart from our example showing the evolution of customer service:

Customer service has improved, no 1 star evaluations since last year!



Run the Check-Up to get a customized report like this:

Analyze your cluster