Elasticsearch OpenSearch Source Filtering, Stored Fields, Fields and Docvalues Fields

By Opster Expert Team - Saskia

Updated: Jun 28, 2023

| 6 min read

How to retrieve selected fields in your search results

Background

When performing a search request, the response returned contains by default an array of 10 hits which include the _source field. The _source field contains the JSON document that was stored at index time, i.e. the exact data that was ingested. 

There are various options for retrieving fields that can help you boost performance or enable additional formatting options during the fetch phase. 

Below we will review how you can control the content of your search hits, including some important background information on how OpenSearch stores documents internally. 

The methods we will review include:

  • Source filtering
  • Stored fields
  • Fields
  • Docvalue fields

Source filtering

In many cases the _source contains more fields than your application needs to consume. It’s a very common practice to choose to return only a partial JSON document, by source filtering.

_source accepts several parameters:

  • true (default): the entire document will be returned as hit
  • false: only the metadata (_index, _id, _score ) will be returned as hits
  • A list of fields to return a partial JSON document: [<field_1>, <field_4>, … ] *
  • Includes: a list of fields to include *
  • Excludes: a list of fields to exclude (useful for (nested) objects) *

* Field names support the wildcard parameter which is useful especially for filtering (nested) objects.

Examples

Create an index with default mapping and store a document:

PUT source - demo / _doc / 1 
{
    "text": "Demo",
    "number": 1,
    "date": "2021-10-20"
}

Get _source by default:

GET source-demo/_search
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "text": "Demo",
                "number": 1,
                "date": "2021-10-20"
            }
        }]
    }
}

Disable _source:

GET source-demo/_search
{
  "_source": false
}
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0
        }]
    }
}

Filter _source:

GET source-demo/_search
{
  "_source": ["text"]
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "text": "Demo"
            }
        }]
    }
}

Filter _source:

GET source-demo/_search
{
  "_source": {
	"excludes": ["text"]
  }
}
{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [{
            "_index": "source-demo",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
                "date": "2021-10-20",
                "number": 1
            }
        }]
    }
}

However, using the _source parameter isn’t always the best choice performance wise and feature wise. There are a few better options that you may want to consider, as outlined below.

Performance optimization

When you want to filter the fields to return, the JSON document needs to be parsed, filtered and sent back to your client as part of the response. 

Something needs to parse the JSON, whether it’s your application or OpenSearch. 

If you let OpenSearch parse and filter the JSON document it’s faster and easier to scale, plus you save network traffic. If your documents are really large or deeply nested, the JSON parsing can be optimized. 

For this case there is a better solution: stored fields.

Stored fields

If you want to save the time that OpenSearch uses for JSON parsing and filtering you can use stored fields. Note that the name and the first paragraph of the official documentation might be a bit misleading. 

Storing fields allows you to load and display only the fields that you need and circumvents loading the entire _source for every hit. If you really need to optimize performance at retrieval time (fetch phase of the search request) you can use stored fields. It’s an additional Lucene data structure. It basically increases disk space and partially duplicates the JSON data that is already stored in the _source field. 

As always, it depends on your application requirements – you need to determine whether quick retrieval matters more than fast and “slim” storage, or vice versa. 

To enable stored fields you need to use an additional mapping parameter for each field you want to store. 

This feature is especially useful for nested documents and helps to boost performance at search time. 

Please note that the values are returned as an array. 

Examples

Create an index:

PUT stored-fields-demo
{
  "mappings": {
    "properties": {
      "nested": {
        "type": "nested",
        "properties": {
          "stored-nested-field": {
            "type": "keyword",
            "store": true
          }
        }
      },
      "object" : {
        "type" : "object",
        "properties": {
          "stored-object-field" : {
            "type" : "keyword",
            "store" : true
          }
        }
      }
    }
  }
}

Add a document:

PUT stored-fields-demo/_doc/1
{
  "nested" : {
    "stored-nested-field" : "Test"
  },
  "object" : {
    "stored-object-field" : "Example"
  }
}

Disable _source, return all stored fields:

GET stored-fields-demo/_search
{
  "_source": false, 
  "stored_fields": ["*"]
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stored-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "object.stored-object-field" : [
            "Example"
          ]
        }
      }
    ]
  }
}

Nested stored fields can only be returned within inner_hits:

GET stored-fields-demo/_search
{
  "_source": false, 
  "stored_fields": ["object.stored-object-field"],
  "query": {
    "nested": {
      "path": "nested",
      "query": {
        "match_all": {}
      },
      "inner_hits": {
        "stored_fields" : ["nested.*"]
      }
    }
  }
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "stored-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "object.stored-object-field" : [
            "Example"
          ]
        },
        "inner_hits" : {
          "nested" : {
            "hits" : {
              "total" : {
                "value" : 1,
                "relation" : "eq"
              },
              "max_score" : 1.0,
              "hits" : [
                {
                  "_index" : "stored-fields-demo",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "nested",
                    "offset" : 0
                  },
                  "_score" : 1.0,
                  "fields" : {
                    "nested.stored-nested-field" : [
                      "Test"
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

Fields

If performance is not an issue but you want to make sure that the returned values have a uniform format according to the mapping, using fields instead of _source is a good option. 

It will check the mapping of the field and use that information for reformatting. OpenSearch is pretty robust and will store something like this:

“1”, “12”, “123” 

As a numeric value when the mapping is defined accordingly. 

This feature is called coerce and is enabled by default. Some data you entered might have this “wrong” datatype and fields will parse those values according to the mapping type and return them in a uniform way. 

The best feature is that you can change the format of dates and Geo-Types on the fly at retrieval time. 

This means that no matter how dates were stored in the source, you can just change the format as needed, and the same goes for geodata. 

Please note that the values are returned as an array. 

Examples

Create the mapping:

PUT fields-api-demo
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "german"
      },
      "location": {
        "type": "geo_point"
      },
      "date": {
        "type": "date"
      },
      "number": {
        "type": "long"
      }
    }
  }
}

Store a document:

POST fields-api-demo/_doc
{
  "content": "Alles ist möglich",
  "date": "2021-10-13",
  "location": "ezs42",
  "number": "123"
}

Retrieve fields:

GET fields-api-demo/_search
{
  "fields": [
    "*"
  ]
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "fields-api-demo",
        "_type" : "_doc",
        "_id" : "IacYeXwBJ36HxOJriZLi",
        "_score" : 1.0,
        "_source" : {
          "content" : "Alles ist möglich",
          "date" : "2021-10-13",
          "location" : "ezs42",
          "number" : "123"
        },
        "fields" : {
          "date" : [
            "2021-10-13T00:00:00.000Z"
          ],
          "number" : [
            123
          ],
          "location" : [
            {
              "coordinates" : [
                -5.625,
                42.5830078125
              ],
              "type" : "Point"
            }
          ],
          "content" : [
            "Alles ist möglich"
          ]
        }
      }
    ]
  }
}

Please compare _source and fields. Reformat date:

GET fields-api-demo/_search?filter_path=hits.hits
{
  "_source": false, 
  "fields": [
    {
      "field": "date",
      "format": " dd.MM.YYYY"
    }
  ]
}
{
  "hits" : {
    "hits" : [
      {
        "_index" : "fields-api-demo",
        "_type" : "_doc",
        "_id" : "IacYeXwBJ36HxOJriZLi",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "13.10.2021"
          ]
        }
      }
    ]
  }
}

docvalue_fields

Another way to make your search requests more efficient and to completely avoid loading the _source is to use docvalue_fields. 

doc_values are stored on disk and store the same values as _source, however they are handled by the file system cache, so they are outside the Java heap. This data structure is optimized for sorting and aggregations. 

You can leverage doc_values by using this information to retrieve field values, which completely circumvents _source. Unfortunately this cannot be used for the types text and annotated_text. 

Please note that the values are returned as an array. 

Examples

Create an index with default mappings, and add a document:

PUT docvalue-fields-demo/_doc/1
{
  "number" : 1,
  "text" : "Lorem ipsum...",
  "date" : "2021-10-21"
}

Retrieve docvalue_fields:

GET docvalue-fields-demo/_search
{
  "_source": false,
  "docvalue_fields": ["date", "number", "text.keyword"]
}
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "docvalue-fields-demo",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "date" : [
            "2021-10-21T00:00:00.000Z"
          ],
          "number" : [
            1
          ],
          "text.keyword" : [
            "Lorem ipsum..."
          ]
        }
      }
    ]
  }
}

Summary – all options in a nutshell

_source: 

  • Default, easy to use and flexible.
  • Returns all content exactly as you indexed it.
  • JSON parsing.

stored_fields: 

  • Requires additional disk space.
  • Requires additional mapping parameters.
  • Boosts read-performance for deeply nested fields.
  • The code to return nested fields is rather bloated.
  • Values are returned as an Array.

fields:

  • Returns uniform values consistent with the mapping.
  • Better formatting options for date and geo-data.
  • Values are returned as an Array.

docvalue_fields:

  • doc_values are enabled by default for most data types.
  • Cannot be used to return text or annotated_text.
  • Loads data from the file system cache.
  • Values are returned as an Array.

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?