Elasticsearch How to Create Data Streams in Elasticsearch

Elasticsearch How to Create Data Streams in Elasticsearch

Opster Team

March 2021


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

To improve your cluster’s indexing and data streams, we recommend you run the Elasticsearch Configuration Check-Up. The Check-Up will also help you optimize other important settings in Elasticsearch to improve performance.

Run the Elasticsearch check-up to receive recommendations like this:

checklist Run Check-Up
error

An indexing burst is affecting the performance of the following nodes

error-img

Description

The node is unable to keep up with indexing requests, and as a result indexing requests are being queued. If the write queue reaches full capacity, index requests will be rejected, which may cause data loss if...

error-img

Recommendation

In order to resolve your indexing bursts, based on your specific ES deployment, we recommend that you...

1

X-PUT curl -H "Content-Type: application/json" [customized recommendation]

What is a data stream?

The Elasticsearch data stream is an abstraction layer between the names used by applications to facilitate ingestion and search operations on data, and on the underlying indices used by Elasticsearch to store that data. Data streams let you store append-only time series data across multiple indices while providing you with a single named resource for requests. Data sent to a data stream is stored in indices with a name format like this:

.ds---
  • The date is the date the index was created (not to be confused with daily indices).
  • The generation number is a serial number which increases by one each time the index rolls over.
.ds-mylogs-2021.03.01-00002

Beyond that, a data stream works for the most part in the same way as a regular index, with most of the standard Elasticsearch commands (subject to certain limitations which will be explained further in this article).

Data stream use cases

Data streams are used for things like logs, events and metrics. The key features of the data streams are:

  • Time-based data
  • Very rarely updated (ideally never)

Differences between a data stream and a regular index

A data stream mostly works in the same way as a regular index, with most of the standard Elasticsearch commands. However, the following limitations apply to data streams:

  • A data stream is an abstraction layer – the data is stored in underlying .ds indices.
  • They must contain @timestamp field mapped as a date or date_nanos.
  • Deleting and updating streams can only be carried out using the “_update_by_query” or “_delete_by_query” APIs.
  • You cannot create documents directly in the underlying .ds index, although you may update or delete data directly in a .ds index. 
  • You cannot carry out any of the following operations on the underlying .ds indices: clone, close, delete, freeze, shrink and split.

How to create a data stream

Create an ILM Policy

The ILM policy will manage the underlying indices. In this case, we will define that the index should roll over when the shard size reaches 50GB and that the index will be deleted when it reaches an age of 60 days. 

PUT /_ilm/policy/my-policy-delete60d
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB"
          }
        }
      },
      "delete": {
        "min_age": "60d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Create an index template for the data stream

The template below defines that we will use the lifecycle policy we just created.

The “data_stream”:{} object indicates that it is a data stream and not a regular index. You can also add index mappings and other settings here, just as you would for a regular index.

PUT /_index_template/my-logs-data-stream
{
  "index_patterns": [ "logs-data-stream*" ],
  "data_stream": { },
  "priority": 500,
  "template": {
	"settings": {
  	    "index.lifecycle.name": "my-policy-delete60d"
	},
	"mappings":{
  	    "properties": {
    	        "@timestamp":{
      	      "type":"date"
    	        }
  	    }
	}
  }
}

Create the data stream / indexing

By sending data to the data stream name, the data stream will get created automatically using the specifications in the template.

POST /logs-data-stream-test/_doc/
{
  "@timestamp": "2021-04-07T12:02:15.000Z",

  "message": "Hello world"
}

Note that the indexing command is just the same as we would use for a regular index. It’s the index template which tells Elasticsearch to create a data stream rather than a regular index.  

The standard bulk indexing API is used for indexing but remember – no updates or deletes are allowed. Note also that instead of  PUT /_doc/_id you can use the following syntax to set the document ID:

PUT /logs-data-stream-test/_create/my-id
{
  "@timestamp": "2021-04-07T12:02:15.000Z",

  "message": "I just set the ID of this document"
}

Working with data streams

Searching the data stream

There is no difference between searching a data stream or a regular index. 

If you are using a data stream, it will search all of the backing indices – i.e. all of the data present in the data stream.

GET /logs*/_search
{
  "query": {
    "match": {
      "message": "hello"
    }
  }
}

Deleting data 

The recommended way of deleting data from a data stream is through the ILM policy. This means that indices over a certain age automatically get deleted. However, beware that the retention date is based on the age of the index, so an index created 60 days ago may contain much more recent data. 

It is possible to delete individual documents or sets of documents using _delete_by_query. This query is the same for data streams and for regular indices.

POST /logs*/_delete_by_query
{
  "query": {
    "match": {
      "log_type": "alpha"
    }
  }
}

It is also possible to delete the entire data stream and all of the underlying indices:

DELETE /_data_stream/logs-data-stream-test

If you know the underlying index name and ID of the document you can delete documents from the underlying index like this:

DELETE /.ds-logs-data-stream-2020.04.18-000002/_doc/dfdpvnfBt7VVZ

Updating data

Data streams are not intended for use cases where you need to update data regularly. However, if you need to you can do update documents from the underlying index by performing the actions below.

First, get the sequence number, primary term number, ID and backing index of the document to be deleted:

GET /logs-data-stream/_search
{
  "seq_no_primary_term": true,
  "query": {
    "match": {
      "id": "abc123"
    }
  }
}

Then use all four items from the output of the above request to run the following:

PUT .ds-logs-data-stream-2020-04-03-000002/_doc/aasdfhghtRcc563?if_seq_no=0&if_primary_term=1
{
  "@timestamp": "2020-03-07T12:02:07.000Z",
   "id": "abc123"
    "message": "I updated this document"
}

Updating mappings

You can update dynamic settings or add new fields to a data stream similar to the way you do a regular index. If the field is a new field you can add it to the index template. 

However, just like regular indices you cannot change mapping types once the data stream has been created. If you were to change a mapping type on an existing field in a template then you would end up with two different mapping types (mapping conflict) on the same field within the data stream. This would make it impossible to search within the data stream on that field. If you need to modify the mapping type on an existing field, then you will have to create a new data stream with the appropriate mappings and re-index the entire data stream.   

Other useful commands

To see the underlying indices of the data stream, you can use this command:

GET _data_stream/my-data-stream



Run the Check-Up to get a customized report like this:

Analyze your cluster