Everybody’s talking about (or to) ChatGPT and worrying about AI taking over the world – but how will these new developments affect how we build site and enterprise search? Charlie Hull, Managing Consultant at search relevance specialists OpenSource Connections and a search expert for over 20 years, explains…
Search engines have historically struggled to capture meaning and context from both their source documents and user queries. The ranking of search results in these engines is derived from statistical methods that calculate the rarity of a word within a document and within the entire set of source documents (TF/IDF). Thus, “mackerel” may be more significant than “fishing”, and documents containing only the former word deemed more relevant than those containing only the latter for a query like “what is mackerel fishing”. However, the traditional search engine does not derive any understanding of the combined phrase, nor its semantic similarity to “crab catching”. It simply counts words, without knowing what they represent.
Search is essentially a process of matching a user query to a set of results, which could be documents, emails, products etc. The user is very likely to use different terms to those in the source documents, especially if they are unfamiliar with the subject area, speak a different language or cannot express their need easily and accurately. This means that although traditional search engines can give high precision results for some queries, they struggle when trying to retrieve results for queries that are “like” those queries (e.g. “catch oily fish” in the example above), or to retrieve results that do not match the query exactly but are about similar subjects (e.g. a document about “trawling herring”), increasing recall. They simply do not “know” the relationship between these words, terms and concepts.
Synonyms are often used (“mackerel” == “oily fish”) to make it possible to retrieve these further results. Query rewriting using systems like Querqy/SMUI can also help improve results using business rules (“if the query contains word ‘mackerel’, boost results from category ‘types of fish’ and add the synonym word “oily fish’’”). Spelling suggestions can suggest words from the index that are only a few characters different from the query word. Using Learning to Rank, the order of results can be re-arranged by an algorithm trained on signals that indicate the ‘best’ order for a set of queries. These are all techniques used by relevance engineers to improve the performance of traditional search engines.
Natural Language Processing (NLP) techniques have also been used to enhance this basic behaviour, such as named entity extraction (the automatic recognition of phrases and words like “South Africa” and “IBM” as potentially significant). Some commercial search vendors have in the past claimed capabilities such as ‘concept search’ but evidence for this was usually lacking, and invariably their technology used the same statistical methods under the hood.
Continual research in NLP has recently led to advanced language models, which use neural networks (a specialised form of AI) for machine learning the structure of a large amount of text data. Neural networks are a technique used in deep learning, a subset of machine learning and take their name and structure from imitating the human brain. They can also learn associations between different kinds of data such as images and text labels. Models are evolving very rapidly and there are many to choose from depending on the use case. These models help machines move much closer to ‘understanding’ related concepts and the words and images used to describe them.
Some large language models (LLMs) have been produced by these neural networks, trained on huge web-scale datasets, and at this scale have required significant investment by large companies such as Google and Microsoft. Generative Pretrained Transformer (GPT) models are one example: Transformers are neural networks that specialise in sequential data like text. Chat interfaces like ChatGPT (running on the GPT models) have hugely increased the visibility of these developments and generated considerable excitement amongst the general public. Some of the larger models have (debatably) shown signs of ‘emergent’ properties that point to potential machine understanding of particular areas.
Models may be generic and need to be fine-tuned on specific contexts (e.g. the fishing industry or medical terminology) and may also need to be periodically re-trained as the source data is updated. Models need to be managed, deployed and hosted and depend on a great deal of well-managed and clean training data. Some models are open source while others are only available via APIs on a commercial basis – and these may be retired at any time by the provider. Many public examples exist of bias and inaccuracies in these models – perhaps unsurprising if they have been trained on data from the public Web. Closed source, commercial models are impossible to inspect for bias and to verify the source and veracity of the training data.
One output of these models is dense vector embeddings – numerical representations of our source data or query as a series of numbers representing a position in a multidimensional space. (As an aside, our traditional text search engines also generate a kind of vector – although a much more sparsely filled one, where there are lots of zeroes where words do not appear at all in a particular document). These embeddings need to be stored somewhere to be useful for search and this may be done in a specialised vector database (and many new ones have appeared), or increasingly in new data structures added to traditional search engine platforms. Queries also need to be turned into vectors to allow for vector matching, which may be done using nearest-neighbour techniques (how “close” is our query vector to our document vector in the multidimensional space and thus how relevant is our document to the query).
Vector search is only one potential application for a vector database, which can also be used for image recognition, language translation, similar item recommendation and many more tasks. Some vector databases also have search-specific features such as filtering and facets, and can be considered vector search engines. Traditional search platforms that have added vector features after the fact may not have the raw performance of vector databases, although they do have a raft of stable, fast and highly evolved traditional search features. Combining traditional search ranking, which is highly performant, well understood and prioritises precision, with vector search ranking that may return many more results, prioritising recall, is difficult, as it involves blending the output of two very different matching systems as a hybrid search.
A further challenge is that many of those familiar with creating the machine learning algorithms that power vector search (data scientists, machine learning engineers) are not so familiar with traditional search techniques, many of which may still be the most efficient method for solving certain search problems. Equally, many relevance engineers are still learning these new techniques of vector search, while the excitement around large language models and their applications such as ChatGPT brings increasing pressure to deliver cutting-edge solutions. Successfully scaling, deploying and operating these machine learning models in production (MLOps) is also a challenge. Team structure, collaboration methods and effective training will be important factors.
Explainability is a particular challenge for vector search; as neural networks are extremely complex it is hard to explain in detail why a particular vector search match occurred, whereas in traditional search one can often plainly see a query word occurring in a source document as the reason for a match. This can lead to difficulties with tuning for accuracy and a bad user experience with an associated lack of trust. The issues around bias and training data mentioned above can also be an issue when using third party, closed models.
Successfully navigating this new world of search, peppered with advanced mathematical techniques and buzzwords, promising much but potentially costing more, will depend on a deep understanding of when these new techniques can actually help and how they can be implemented in practice. At OpenSource Connections we specialise in both traditional and vector search approaches with a deep knowledge of the technical options available and we will continue to guide our clients towards solutions that address their business needs via our training and consulting. Despite what some commentators have written these new developments are highly unlikely to replace all traditional techniques – but change is certainly coming.
- https://opensourceconnections.com/blog/2019/12/18/bert-and-search-relevance-part2-dense-vs-sparse/ Blog from Doug Turnbull explaining the difference between traditional (sparse) and dense vectors
- https://haystackconf.com/eu2022/2022/09/27/keynote.html Dmitry Kan with an overview of where vector search is taking us and some of the companies and projects involved
- https://www.haystackconf.com The Haystack conference series run by OpenSource Connections features talks on the many ways to improve search quality – the next event is Haystack US the week of April 24th, featuring talks from Amazon, Elsevier Health, GetYourGuide and EBSCO.
We are sorry that this post was not useful for you!
Let us improve this post!
Tell us how we can improve this post?