Elasticsearch Opensearch k-NN

Average Read Time

5 Mins

Elasticsearch Opensearch k-NN

Opster Expert Team - Gustavo

Nov-2021

Average Read Time

5 Mins

Opster Team

October 2021

Average Read Time

5 Mins


In addition to reading this guide, we recommend you run the Elasticsearch Health Check-Up. It will detect issues and improve your Elasticsearch performance by analyzing your shard sizes, threadpools, memory, snapshots, disk watermarks and more.

The Elasticsearch Check-Up is free and requires no installation.

What is k-NN? 


k-NN stands for k-nearest neighbors and is used to find nearby documents based on vector dimensions. 

This strategy is widely used for recommendations. Based on many parameters (vector dimensions) users are classified and grouped, then when a new user arrives you can use k-NN to compare its vectors with the rest of the users and give relevant recommendations.

Another interesting use case is the “word embedding”, or converting words to vectors. In this scenario, you want to search for “cat” and find things like “pet supplies”, “kitty sand”, and “whiskers”, as opposed to “car” or “catapult”. 

To achieve this, a model is trained to assign vectors to each word based on context. Then, using k-NN, we can know that “cat” and “pet supplies” are neighbors despite them being completely different words.

k-NN Opensearch plugin

Opensearch comes to the rescue with a plugin that handles this distance search. The only thing you need to do is ingest your documents with their vectors as a k-NN field type.

Installation

The all-in installation version of Opensearch includes the plugin, so you don’t have to install it. 

If you are not sure, you can verify your plugins and look for opensearch-knn:

sudo bin/opensearch-plugin list

In case you don’t see opensearch-knn in the list, you can easily install it: 

sudo bin/opensearch-plugin install opensearch-knn

Usage

The k-NN plugin supports 3 methods to obtain the neighbors. Below is a summary of these methods. If you’d like a more in depth explanation, see the official docs.

NameMethodFeaturesUse case
Approximate k-NNHNSW (Hierarchical Navigable Small World) algorithmLow latency, scalableLarge indices (many vectors), no pre-filters
Script Score k-NNBrute forceSupport binary represented fieldsSmaller bodies or pre-filtered
Painless extensionsCustomSupport distance functionsMore complex scenarios

You should use Approximate k-NN for large indices with no pre-filter queries, Script Score for smaller bodies/pre-filtered queries, and Painless extensions in case you need to use a custom distance function to calculate score.

k-NN Mapping 

The plugin introduces a new field type called “knn_vector” that has many configuration options you can check out here

A knn_vector field type looks like this:

{
 "my_vector_field": {
   "type": "knn_vector",
   "dimension": 4,
   "method": {
     "name": "hnsw",
     "space_type": "l2",
     "engine": "nmslib",
     "parameters": {
       "ef_construction": 128,
       "m": 24
     }
   }
 }
}

The most relevant properties are: 

  • dimension: The amount of dimensions your vector will have. 
  • ef_construction: Size of dynamic list used for k-NN. The higher the value, the more precision you’ll have but the slower the indexing speed will be.
  • m: Number of bidirectional links the plugin will create. Impacts on memory usage, keep it between 2 and 100.

Approximate k-NN 

Create a k-nn index 

Let’s ingest some documents to search across using the Approximate k-NN method. We will use the Word Embedding example. The vector array values should be generated using a word embedding library, the ones of this example are made up. 

PUT test-knn
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
        "product_vector": {
          "type": "knn_vector",
          "dimension": 4,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        }
    }
  }
}

Now let’s ingest some example documents.

POST _bulk
{ "index": { "_index": "test-knn", "_id": "1" } }
{ "product_vector": [1, 5, 5, 4], "price": 10.00,"name": "Hygienic sand" }
{ "index": { "_index": "test-knn", "_id": "2" } }
{ "product_vector": [5, 4, 4, 4], "price": 25.00,"name": "Pet supplies pack" }
{ "index": { "_index": "test-knn", "_id": "3" } }
{ "product_vector": [7, 9, 9, 9], "price": 500,"name": "Catapult" }
{ "index": { "_index": "test-knn", "_id": "4" } }
{ "product_vector": [1, 1, 2, 1], "price": 5,"name": "Hot Wheels Car" }

Each element on the vector array is a dimension that represents a feature of the document. Machine learning algorithms can calculate thousands of different dimensions. The dimension property on the mappings supports up to 10,000.

Querying

Let’s say the vector representation for “cat” after training our model is: [2,3,5,6].

GET test-knn/_search
{
  "size": 2,
  "query": {
    "knn": {
      "product_vector": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  }
}
  • size: Amount of results to be returned.
  • k: Amount of neighbors to return for each graph. 
  • vector: Vector representation of the document we are searching.

The response might look like this:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.1,
    "hits" : [
      {
        "_index" : "test-knn",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.1,
        "_source" : {
          "product_vector" : [
            1,
            5,
            5,
            4
          ],
          "price" : 10.0,
          "name" : "Hygienic sand"
        }
      },
      {
        "_index" : "test-knn",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.0625,
        "_source" : {
          "product_vector" : [
            5,
            4,
            4,
            4
          ],
          "price" : 25.0,
          "name" : "Pet supplies pack"
        }
      }
    ]
  }
}

See how Catapult and car are not being returned but Pet Supply and Sand are (or relevant/more relevant).

The interesting thing is the criteria used to set this distance between k elements may be different. Word embedding models are trained with tons of text like books or newspapers, but dimensions can be things like movie category preferences, ratings, people age, city, movie duration, etc.

Usage with filters

As we said, Approximate k-NN can not be pre-filtered, so if for example we want to filter the products by price after running the k-NN, we can use a post_filter clause.

GET test-knn/_search
{
  "size": 2,
  "query": {
    "knn": {
      "product_vector": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  },
  "post_filter": {
    "range": {
      "price": {
        "lte": 20
      }
    }
  }
}

With this query the Pet supplies pack will not be returned because of its price.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.1,
    "hits" : [
      {
        "_index" : "test-knn",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.1,
        "_source" : {
          "product_vector" : [
            1,
            5,
            5,
            4
          ],
          "price" : 10.0,
          "name" : "Hygienic sand"
        }
      }
    ]
  }
}

Scoring Script k-NN

With this approach we can apply a filter before executing the nearest neighbor search. For this reason it does not scale very well but is useful for dynamic search cases where the index body may vary based on other conditions.

Mappings

PUT my-knn-index-1
{
  "mappings": {
    "properties": {
      "product_vector": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}

As you can see this is a shorter version of the approximate search because many of the configuration options are exclusive to that approach.

If you want to use this index for both script score and approximation you have to set knn.index: true and set the knn.space_type.

If you only want to use the script score capabilities it’s recommended to set knn.index: false. You will benefit from less memory usage and faster indexing but you will not be able to perform standard kNN queries against the index.

Query

GET my-knn-index-1/_search
{
 "size": 4,
 "query": {
   "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "product_vector",
         "query_value": [2.0, 3.0, 5.0, 6.0],
         "space_type": "cosinesimil"
       }
     }
   }
 }
}

All parameters are required.

The most relevant are:

  • lang: knn instead of painless.
  • source: the name of the script.
  • query_value: The point you want to find neighbors, represented as an array of floats that must coincide with the mappings dimensions.
  • space_type: The function used to calculate distance between points.

Binary data in Score Script

Score script k-NN also supports query_value as binary instead of an array. In this case you have to use the Hamming distance space and attach the string as base64 data. You can also represent this value as type long.

Create the index

PUT my-index
{
  "mappings": {
    "properties": {
      "my_binary": {
        "type": "binary",
        "doc_values": true
      },
      "color": {
        "type": "keyword"
      }
    }
  }
}

Index documents

POST _bulk
{ "index": { "_index": "my-index", "_id": "1" } }
{ "my_binary": "SGVsbG8gV29ybGQh", "color" : "RED" }
{ "index": { "_index": "my-index", "_id": "2" } }
{ "my_binary": "ay1OTiBjdXN0b20gc2NvcmluZyE=", "color" : "RED" }

Run a query

GET my-index/_search
{
  "size": 2,
  "query": {
    "script_score": {
      "query": {
        "bool": {
          "filter": {
            "term": {
              "color": "RED"
            }
          }
        }
      },
      "script": {
        "lang": "knn",
        "source": "knn_score",
        "params": {
          "field": "my_binary",
          "query_value": "U29tZXRoaW5nIEltIGxvb2tpbmcgZm9y",
          "space_type": "hammingbit"
        }
      }
    }
  }
}

Note you are looking for RED only neighbors, so any other color will be ignored.

Painless Script

This method allows you to manipulate the distance function values and score. You can choose from a list of available functions and, for example, multiply the value of the score, or involve other fields of the document in the equation. All the functions receive the field name and vector value as parameters.

See the example below.

GET my-knn-index-1/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "1.0 + cosineSimilarity(params.query_value, doc[params.field])",
        "params": {
          "field": "product_vector",
          "query_value": [
            7.9,
            2.9,
            4.9,
            4.9
          ]
        }
      }
    }
  }
}

This script will execute a cosine similarity function to calculate score, and then sum 1.0 to the total.

Constraints

  • You can choose between l2Squared, l1Norm and cosineSimilarity distance functions (more information here).
  • query_value dimension size must match with the one defined in the mapping or the query will fail.
  • If a document does not have a vector value then the query will fail. You can fix this by validating this value before running the distance function with this code: “source”: “doc[params.field].size() == 0 ? 0 : 1 / (1 + l2Squared(params.query_value, doc[params.field]))”.
  • Scores can only be positive, so documents with vector fields will score higher.


Run the Check-Up to get a customized report like this:

Analyze your cluster