Elasticsearch An Overview of Source Filtering, Stored Fields, Fields and Docvalues Fields

Average Read Time

5 Mins

Elasticsearch An Overview of Source Filtering, Stored Fields, Fields and Docvalues Fields

Opster Expert Team - Saskia

Nov-2021

Average Read Time

5 Mins

Opster Team

October 2021

Average Read Time

5 Mins


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

How to retrieve selected fields in your search results

Background

When performing a search request, the response returned contains by default an array of 10 hits which include the _source field. The _source field contains the JSON document that was stored at index time, i.e. the exact data that was ingested. 

There are various options for retrieving fields that can help you boost performance or enable additional formatting options during the fetch phase. 

Below we will review how you can control the content of your search hits, including some important background information on how Elasticsearch stores documents internally. 

The methods we will review include:

  • Source filtering
  • Stored fields
  • Fields
  • Docvalue fields

Beyond these, if you’d like to learn more about how to use runtime fields, you can read this article on the subject.

Source filtering

In many cases the _source contains more fields than your application needs to consume. It’s a very common practice to choose to return only a partial JSON document, by source filtering.

_source accepts several parameters:

  • true (default): the entire document will be returned as hit
  • false: only the metadata (_index, _id, _score ) will be returned as hits
  • A list of fields to return a partial JSON document: [, , … ] *
  • Includes: a list of fields to include *
  • Excludes: a list of fields to exclude (useful for (nested) objects) *

* Field names support the wildcard parameter which is useful especially for filtering (nested) objects.

Examples

Create an index with default mapping and store a document:

PUT source - demo / _doc / 1 
{
    "text": "Demo",
    "number": 1,
    "date": "2021-10-20"
}

Get _source by default:

GET source-demo/_search
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "text": "Demo",
                "number": 1,
                "date": "2021-10-20"
            }
        }]
    }
}

Disable _source:

GET source-demo/_search
{
  "_source": false
}
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0
        }]
    }
}

Filter _source:

GET source-demo/_search
{
  "_source": ["text"]
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "text": "Demo"
            }
        }]
    }
}

Filter _source:

GET source-demo/_search
{
  "_source": {
	"excludes": ["text"]
  }
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "date": "2021-10-20",
                "number": 1
            }
        }]
    }
}

However, using the _source parameter isn’t always the best choice performance wise and feature wise. There are a few better options that you may want to consider, as outlined below.

Performance optimization

When you want to filter the fields to return, the JSON document needs to be parsed, filtered and sent back to your client as part of the response. 

Something needs to parse the JSON, whether it’s your application or Elasticsearch. 

If you let Elasticsearch parse and filter the JSON document it’s faster and easier to scale, plus you save network traffic. If your documents are really large or deeply nested, the JSON parsing can be optimized. 

For this case there is a better solution: stored fields.

Stored fields

If you want to save the time that Elasticsearch uses for JSON parsing and filtering you can use stored fields. Note that the name and the first paragraph of the official documentation might be a bit misleading. 

Storing fields allows you to load and display only the fields that you need and circumvents loading the entire _source for every hit. If you really need to optimize performance at retrieval time (fetch phase of the search request) you can use stored fields. It’s an additional Lucene data structure. It basically increases disk space and partially duplicates the JSON data that is already stored in the _source field. 

As always, it depends on your application requirements – you need to determine whether quick retrieval matters more than fast and “slim” storage, or vice versa. 

To enable stored fields you need to use an additional mapping parameter for each field you want to store. 

This feature is especially useful for nested documents and helps to boost performance at search time. 

Please note that the values are returned as an array. 

Examples

Create an index:

PUT stored-fields-demo
{
  "mappings": {
    "properties": {
      "nested": {
        "type": "nested",
        "properties": {
          "stored-nested-field": {
            "type": "keyword",
            "store": true
          }
        }
      },
      "object" : {
        "type" : "object",
        "properties": {
          "stored-object-field" : {
            "type" : "keyword",
            "store" : true
          }
        }
      }
    }
  }
}

Add a document:

PUT stored-fields-demo/_doc/1
{
  "nested" : {
    "stored-nested-field" : "Test"
  },
  "object" : {
    "stored-object-field" : "Example"
  }
}

Disable _source, return all stored fields:

GET stored-fields-demo/_search
{
  "_source": false, 
  "stored_fields": ["*"]
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stored-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "object.stored-object-field" : [
            "Example"
          ]
        }
      }
    ]
  }
}

Nested stored fields can only be returned within inner_hits:

GET stored-fields-demo/_search
{
  "_source": false, 
  "stored_fields": ["object.stored-object-field"],
  "query": {
    "nested": {
      "path": "nested",
      "query": {
        "match_all": {}
      },
      "inner_hits": {
        "stored_fields" : ["nested.*"]
      }
    }
  }
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stored-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "object.stored-object-field" : [
            "Example"
          ]
        },
        "inner_hits" : {
          "nested" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : 1.0,
              "hits" : [
                {
                  "_index" : "stored-fields-demo",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "nested",
                    "offset" : 0
                  },
                  "_score" : 1.0,
                  "fields" : {
                    "nested.stored-nested-field" : [
                      "Test"
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Fields

If performance is not an issue but you want to make sure that the returned values have a uniform format according to the mapping, using fields instead of _source is a good option. 

The fields option was introduced in Elasticsearch version 7.10. 

It will check the mapping of the field and use that information for reformatting. Elasticsearch is pretty robust and will store something like this:

“1”, “12”, “123” 

As a numeric value when the mapping is defined accordingly. 

This feature is called coerce and is enabled by default. Some data you entered might have this “wrong” datatype and fields will parse those values according to the mapping type and return them in a uniform way. 

The best feature is that you can change the format of dates and Geo-Types on the fly at retrieval time. 

This means that no matter how dates were stored in the source, you can just change the format as needed, and the same goes for geodata. 

Please note that the values are returned as an array. 

Examples

Create the mapping:

PUT fields-api-demo
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "german"
      },
      "location": {
        "type": "geo_point"
      },
      "date": {
        "type": "date"
      },
      "number": {
        "type": "long"
      }
    }
  }
}

Store a document:

POST fields-api-demo/_doc
{
  "content": "Alles ist möglich",
  "date": "2021-10-13",
  "location": "ezs42",
  "number": "123"
}

Retrieve fields:

GET fields-api-demo/_search
{
  "fields": [
    "*"
  ]
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "fields-api-demo",
        "_type" : "_doc",
        "_id" : "IacYeXwBJ36HxOJriZLi",
        "_score" : 1.0,
        "_source" : {
          "content" : "Alles ist möglich",
          "date" : "2021-10-13",
          "location" : "ezs42",
          "number" : "123"
        },
        "fields" : {
          "date" : [
            "2021-10-13T00:00:00.000Z"
          ],
          "number" : [
            123
          ],
          "location" : [
            {
              "coordinates" : [
                -5.625,
                42.5830078125
              ],
              "type" : "Point"
            }
          ],
          "content" : [
            "Alles ist möglich"
          ]
        }
      }
    ]
  }
}

Please compare _source and fields. Reformat date:

GET fields-api-demo/_search?filter_path=hits.hits
{
  "_source": false, 
  "fields": [
    {
      "field": "date",
      "format": " dd.MM.YYYY"
    }
  ]
}
{
  "hits" : {
    "hits" : [
      {
        "_index" : "fields-api-demo",
        "_type" : "_doc",
        "_id" : "IacYeXwBJ36HxOJriZLi",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "13.10.2021"
          ]
        }
      }
    ]
  }
}

docvalue_fields

Another way to make your search requests more efficient and to completely avoid loading the _source is to use docvalue_fields

doc_values are stored on disk and store the same values as _source, however they are handled by the file system cache, so they are outside the Java heap. This data structure is optimized for sorting and aggregations. 

You can leverage doc_values by using this information to retrieve field values, which completely circumvents _source. Unfortunately this cannot be used for the types text and annotated_text. 

Please note that the values are returned as an array. 

Examples

Create an index with default mappings, and add a document:

PUT docvalue-fields-demo/_doc/1
{
  "number" : 1,
  "text" : "Lorem ipsum...",
  "date" : "2021-10-21"
}

Retrieve docvalue_fields:

GET docvalue-fields-demo/_search
{
  "_source": false,
  "docvalue_fields": ["date", "number", "text.keyword"]
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "docvalue-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "2021-10-21T00:00:00.000Z"
          ],
          "number" : [
            1
          ],
          "text.keyword" : [
            "Lorem ipsum..."
          ]
        }
      }
    ]
  }
}

Summary – all options in a nutshell

_source: 

  • Default, easy to use and flexible.
  • Returns all content exactly as you indexed it.
  • JSON parsing.

stored_fields: 

  • Requires additional disk space.
  • Requires additional mapping parameters.
  • Boosts read-performance for deeply nested fields.
  • The code to return nested fields is rather bloated.
  • Values are returned as an Array.

fields:

  • Returns uniform values consistent with the mapping.
  • Better formatting options for date and geo-data.
  • Values are returned as an Array.

docvalue_fields:

  • doc_values are enabled by default for most data types.
  • Cannot be used to return text or annotated_text.
  • Loads data from the file system cache.
  • Values are returned as an Array.

In addition to reading this guide, we recommend you run the Elasticsearch Configuration Check-Up. The Check-Up will help you check and optimize important settings in Elasticsearch to improve performance.



Run the Check-Up to get a customized report like this:

Analyze your cluster