Diary of an NLP guy: Introduction to Neural Search

Abstract

The information retrieval model behind search engines, like Apache Solr or Elasticsearch, has remained largely unchanged for decades. Following the rise of transformer-based language models like BERT and GPT in the late 2010s, an alternative model known as “neural search” has become a part of high-end IR solutions in the early 2020s, promising to overcome the limitations of traditional model.

Introduction to Search Engines

This chapter provides an overview of the key terms and concepts used throughout the rest of the document. If you are already familiar with search engines, you may choose to move on to the next chapter.

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results, often referred to as hits, are usually presented in a list. The most popular form of a search engine is a Web search engine, like Google Search or Microsoft Bing, which searches for information on the World Wide Web. Large enterprises may also use internal search engines to search for information within their own information systems. Many popular enterprise search platforms, like Apache Solr and Elasticsearch, are based on Apache Lucene, an open-source search engine software library.

A successful search engine should be precise (returning results that are relevant to the user's needs), have a high recall (displaying the most relevant results at the top of the list), and be efficient (providing results in real time with minimal delay). As users, we often only need a specific piece of information and are satisfied with just one answer, rather than having to sift through numerous results. Therefore, the ideal search engine would quickly provide the most relevant answer as the first result.

A search engine typically has the following main responsibilities:

Indexing: Efficiently storing and organizing data to allow for fast retrieval.
Querying: Providing the ability to search through the data using natural language, keywords, or specific syntax.
Ranking: Presenting and ranking the results according to certain criteria to best meet the user's information needs.

During the indexing process, a search engine typically creates models (sometimes called representations) of the information being indexed (the indexed documents) and stores them in local storage. It's important to note that the "indexed documents" may not necessarily be one-to-one matches with real-world documents - the ingestion process may break them down into smaller components, like chapters, before indexing. At query time, the engine creates a model of the query, identifies a subset of indexed documents that are potential matches, and calculates the similarity between the model of the query and the models of the potential matches, generating a relevancy score for each one. The potential matches are then ranked in the search results list according to their relevancy scores, with the most relevant ones appearing first. A search query typically contains two parts:

A filter part of the query, which is a Boolean expression that evaluates to true or false for each document and is used to narrow down the documents that need to be scored. This filter is based on the metadata of the documents (such as language, intended audience, and time of publication) and/or the most important parts of the document content (such as keywords).
A scoring part of the query, which is used to determine the relevance of each of the models of the filtered documents to the model of the query.

There are several different models for representing documents and queries, as well as various metrics for measuring relevancy. Here is a brief overview of the traditional model and its associated metrics.

Overview of Traditional Information Retrieval Models

This chapter may contain technical details that may be difficult for non-technical readers to understand. As this information is not essential for understanding the limitations of traditional IR models and the principles behind neural search, you may choose to skip to the next chapter if desired.

Apache Lucene supports a number of pluggable scoring models. One of the most popular models is the Vector Space Model (VSM) of Information Retrieval. VSM operates by breaking down documents and queries into text fragments known as “terms”. A term may be a single word, or a longer phrase, optionally normalized, e.g., by lemmatization. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. The intuition behind approach is that a matching on a rare term (such as “Lucene”) is more important than a matching on a common term (such as “we”). By default, Lucene uses tf-idf numerical statistic with VSM to calculate the relevance score for a single term. The overall relevance of a document for a query is calculated as the sum of the scores for each query term.

Lucene also supports probabilistic models such as Okapi BM25 and DFR. While these models have a different theoretical foundation than VSM, they generally produce similar results and have the same limitations as VSM, which will be discussed in the next chapter.

The Shortcomings of Traditional IR Models

Here are some limitations of traditional IR Models:

Lack of support for synonyms out of the box: for example, a query about "kids" would not match a document about "children" unless these terms are explicitly defined as synonyms in the search engine's resource file.
Lack of support for semantically related words: for example, a query about "puppy" would not match a document about "dogs", and a query about a “workplace policy on dogs" would not match a “policy on pets." Traditional document models are not aware of the semantic similarity.
High sensitivity to typographical and spelling errors.
Lack of distinction between homographs: for example, the query "What are the things to consider when choosing a bank?" would match a document discussing a river bank, even though it is clear from the context that the query is about financial institutions.
A bag-of-words approach that does not take word order into account: for example, the query "Does a US traveler to Serbia need a visa?" would incorrectly match documents answering whether a traveler from Serbia to the US needs a visa.
Lack of support for paraphrases: for example, the query "Can you get medicine for someone pharmacy?" would not match a document answering the same question expressed as "Can a patient have a friend or family member pick up a prescription?" because there is no match on any important terms between the two.

The New Kid on the Block - Neural Search

The term "neural search" is a less formal version of "neural information retrieval," which was first introduced at a research workshop at the SIGIR 2016 conference on using deep neural networks in the field of information retrieval.

Neural search is a type of search technology that uses artificial neural networks to improve the accuracy and relevance of search results. Unlike traditional IR models, which are based on keywords, neural search uses machine learning to understand the meaning and context of search queries, and to generate more relevant and personalized results. This can be particularly helpful in situations where the search query is complex or ambiguous and a traditional IR model may struggle to return accurate results.

Deep neural networks are good at providing a representation of textual data that captures word and document semantics, allowing a machine to say which words and documents are semantically similar, overcoming the shortcomings of traditional document models. These representations are known as “embeddings”.

Text Embeddings

Since all internals of the computer systems are designed to work with numbers, one of the main challenges in natural language processing (NLP) is how to convert text into a numerical form that an NLP algorithm can process and use to solve real-world tasks. Embeddings are the state-of-the-art solution to this problem.

The embeddings are learned vector representations of discrete, categorical variables, such as words. In computer science, a "vector" refers to a one-dimensional array data structure, like [0.45, 0.32, -0.59…]. A typical word embedding is an array containing a few hundred or thousand floating-point numbers (floats). Useful embeddings map similar categorical variables to points that are close to each other in the vector space. As a result, complex real-world tasks related to similarity, such as recommendations or search based on a query, can be reduced to a nearest neighbor search (NNS), a well-known optimization problem of finding the point in a given vector space that is closest to a given point. Embeddings and NNS can be used with any type of categorical variables (such as image embeddings or song embeddings), but this document focuses on words and text in general.

While it may be a useful exercise to manually generate small embeddings for a small number of words, good embeddings are usually created using machine learning models. Word2vec (developed by Google in 2013) and GloVe (developed by Stanford in 2014) were some of the first successful algorithms for generating word embeddings using machine learning. These were followed by algorithms like doc2vec, which can create embeddings of variable-length pieces of text such as sentences, paragraphs, or entire documents.

One major limitation of traditional word embeddings from the 2010s is their lack of context encoding in the resulting vector. These algorithms produce a fixed, pre-trained vector for words with the same spelling, even if their meanings are slightly or completely different (homographs). For example, already mentioned "bank" as a financial institution and "bank" as a slope bordering a river would have the same Word2vec embeddings, even though their meanings are not similar at all.

Contextual embeddings are a type of text embeddings that overcomes mentioned limitation, by taking into account the context in which a word or phrase appears, and generating a unique vector for each appearance of the word based on the surrounding words. This allows the model to capture the meaning of words in a more nuanced and accurate way, and to generate more relevant and optionally personalized results (by including personal preferences in the encoded context). Deep learning language models based on transformer, like BERT and GPT, are able to generate contextual embeddings.

Sentence embedding is a method for representing sentences, or even paragraphs as numerical vectors. This is typically done by generating contextual embeddings tor the tokens (words, or parts of the words) in the sentence, and then combining these token vectors in a way that captures the meaning of the sentence. There are several different methods for creating sentence embeddings, including averaging the word vectors, using a recurrent neural network, and training a separate model to generate sentence embeddings.

One significant constraint of the embeddings generated by transformer-based language models is a maximal length of the input text. By default, BERT models are able to handle no more than 512 subword tokens, approximately 350-400 words. There are special pretrained transformer models for longer documents, that can overcome this limitation. A new version of GPT based embedding model is able to handle input text up to 8192 tokens, making embeddings more convenient to work with long documents.

Neural Search in Practice

The process of comparing the embeddings of a query with the embeddings of the indexed items to find the most similar items is a key aspect of neural search. In the case of document search, the search engine typically compares the sentence embeddings of the query to the sentence embeddings of the indexed documents. This comparison is performed using a metric such as Euclidean distance, dot product, or cosine similarity. Depending on the applied sentence embedding method and how we want to treat the documents of different lengths, it may be necessary to normalize the vectors before comparison.

Since generating embeddings, especially sentence embeddings, is computationally intensive and time-consuming, neural search engines commonly generate the embeddings for indexed content during ingestion and store the document's embedding in a dedicated field in the indexed document.

The ingestion process should break real-world document into indexed documents not longer than the maximum size of the underlying embeddings generator’s input.

Inverted index, which are commonly used for fast full-text searches with traditional document retrieval models, are not applicable for embeddings, which are arrays of floats. Given a query embedding vector v that models the information need, the easiest approach for providing neural search would be to calculate the distance (Euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information. This approach is quite expensive, so many approximate strategies are currently under active research, and some of them are applied in the new releases (after 2021) of enterprise search engines.

A search engine may use embeddings in combination with traditional retrieval models to improve the accuracy of search results. For example, a search engine it may use a traditional filter query to narrow down the documents that need to be scored first, then a VSM or BM25 for preliminary ranking, and finally embeddings for re-ranking of top n hits.

Since v9.0 (December 2021), Apache Lucene supports embeddings on a low level, indexing high-dimensionality numeric vectors to perform nearest-neighbor search, using the Hierarchical Navigable Small World graph algorithm.

Since v9.0 (May 2022), Apache Solr supports queries with embeddings, through DenseVectorField field type and K-Nearest-Neighbor (KNN) Query Parser.

Diary of an NLP guy

четвртак, 5. јануар 2023.

Introduction to Neural Search