Diary of an NLP guy

четвртак, 5. јануар 2023.

Introduction to Neural Search

Abstract

The information retrieval model behind search engines, like Apache Solr or Elasticsearch, has remained largely unchanged for decades. Following the rise of transformer-based language models like BERT and GPT in the late 2010s, an alternative model known as “neural search” has become a part of high-end IR solutions in the early 2020s, promising to overcome the limitations of traditional model.

Introduction to Search Engines

This chapter provides an overview of the key terms and concepts used throughout the rest of the document. If you are already familiar with search engines, you may choose to move on to the next chapter.

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results, often referred to as hits, are usually presented in a list. The most popular form of a search engine is a Web search engine, like Google Search or Microsoft Bing, which searches for information on the World Wide Web. Large enterprises may also use internal search engines to search for information within their own information systems. Many popular enterprise search platforms, like Apache Solr and Elasticsearch, are based on Apache Lucene, an open-source search engine software library.

A successful search engine should be precise (returning results that are relevant to the user's needs), have a high recall (displaying the most relevant results at the top of the list), and be efficient (providing results in real time with minimal delay). As users, we often only need a specific piece of information and are satisfied with just one answer, rather than having to sift through numerous results. Therefore, the ideal search engine would quickly provide the most relevant answer as the first result.

A search engine typically has the following main responsibilities:

Indexing: Efficiently storing and organizing data to allow for fast retrieval.
Querying: Providing the ability to search through the data using natural language, keywords, or specific syntax.
Ranking: Presenting and ranking the results according to certain criteria to best meet the user's information needs.

During the indexing process, a search engine typically creates models (sometimes called representations) of the information being indexed (the indexed documents) and stores them in local storage. It's important to note that the "indexed documents" may not necessarily be one-to-one matches with real-world documents - the ingestion process may break them down into smaller components, like chapters, before indexing. At query time, the engine creates a model of the query, identifies a subset of indexed documents that are potential matches, and calculates the similarity between the model of the query and the models of the potential matches, generating a relevancy score for each one. The potential matches are then ranked in the search results list according to their relevancy scores, with the most relevant ones appearing first. A search query typically contains two parts:

A filter part of the query, which is a Boolean expression that evaluates to true or false for each document and is used to narrow down the documents that need to be scored. This filter is based on the metadata of the documents (such as language, intended audience, and time of publication) and/or the most important parts of the document content (such as keywords).
A scoring part of the query, which is used to determine the relevance of each of the models of the filtered documents to the model of the query.

There are several different models for representing documents and queries, as well as various metrics for measuring relevancy. Here is a brief overview of the traditional model and its associated metrics.

Overview of Traditional Information Retrieval Models

This chapter may contain technical details that may be difficult for non-technical readers to understand. As this information is not essential for understanding the limitations of traditional IR models and the principles behind neural search, you may choose to skip to the next chapter if desired.

Apache Lucene supports a number of pluggable scoring models. One of the most popular models is the Vector Space Model (VSM) of Information Retrieval. VSM operates by breaking down documents and queries into text fragments known as “terms”. A term may be a single word, or a longer phrase, optionally normalized, e.g., by lemmatization. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. The intuition behind approach is that a matching on a rare term (such as “Lucene”) is more important than a matching on a common term (such as “we”). By default, Lucene uses tf-idf numerical statistic with VSM to calculate the relevance score for a single term. The overall relevance of a document for a query is calculated as the sum of the scores for each query term.

Lucene also supports probabilistic models such as Okapi BM25 and DFR. While these models have a different theoretical foundation than VSM, they generally produce similar results and have the same limitations as VSM, which will be discussed in the next chapter.

The Shortcomings of Traditional IR Models

Here are some limitations of traditional IR Models:

Lack of support for synonyms out of the box: for example, a query about "kids" would not match a document about "children" unless these terms are explicitly defined as synonyms in the search engine's resource file.
Lack of support for semantically related words: for example, a query about "puppy" would not match a document about "dogs", and a query about a “workplace policy on dogs" would not match a “policy on pets." Traditional document models are not aware of the semantic similarity.
High sensitivity to typographical and spelling errors.
Lack of distinction between homographs: for example, the query "What are the things to consider when choosing a bank?" would match a document discussing a river bank, even though it is clear from the context that the query is about financial institutions.
A bag-of-words approach that does not take word order into account: for example, the query "Does a US traveler to Serbia need a visa?" would incorrectly match documents answering whether a traveler from Serbia to the US needs a visa.
Lack of support for paraphrases: for example, the query "Can you get medicine for someone pharmacy?" would not match a document answering the same question expressed as "Can a patient have a friend or family member pick up a prescription?" because there is no match on any important terms between the two.

The New Kid on the Block - Neural Search

The term "neural search" is a less formal version of "neural information retrieval," which was first introduced at a research workshop at the SIGIR 2016 conference on using deep neural networks in the field of information retrieval.

Neural search is a type of search technology that uses artificial neural networks to improve the accuracy and relevance of search results. Unlike traditional IR models, which are based on keywords, neural search uses machine learning to understand the meaning and context of search queries, and to generate more relevant and personalized results. This can be particularly helpful in situations where the search query is complex or ambiguous and a traditional IR model may struggle to return accurate results.

Deep neural networks are good at providing a representation of textual data that captures word and document semantics, allowing a machine to say which words and documents are semantically similar, overcoming the shortcomings of traditional document models. These representations are known as “embeddings”.

Text Embeddings

Since all internals of the computer systems are designed to work with numbers, one of the main challenges in natural language processing (NLP) is how to convert text into a numerical form that an NLP algorithm can process and use to solve real-world tasks. Embeddings are the state-of-the-art solution to this problem.

The embeddings are learned vector representations of discrete, categorical variables, such as words. In computer science, a "vector" refers to a one-dimensional array data structure, like [0.45, 0.32, -0.59…]. A typical word embedding is an array containing a few hundred or thousand floating-point numbers (floats). Useful embeddings map similar categorical variables to points that are close to each other in the vector space. As a result, complex real-world tasks related to similarity, such as recommendations or search based on a query, can be reduced to a nearest neighbor search (NNS), a well-known optimization problem of finding the point in a given vector space that is closest to a given point. Embeddings and NNS can be used with any type of categorical variables (such as image embeddings or song embeddings), but this document focuses on words and text in general.

While it may be a useful exercise to manually generate small embeddings for a small number of words, good embeddings are usually created using machine learning models. Word2vec (developed by Google in 2013) and GloVe (developed by Stanford in 2014) were some of the first successful algorithms for generating word embeddings using machine learning. These were followed by algorithms like doc2vec, which can create embeddings of variable-length pieces of text such as sentences, paragraphs, or entire documents.

One major limitation of traditional word embeddings from the 2010s is their lack of context encoding in the resulting vector. These algorithms produce a fixed, pre-trained vector for words with the same spelling, even if their meanings are slightly or completely different (homographs). For example, already mentioned "bank" as a financial institution and "bank" as a slope bordering a river would have the same Word2vec embeddings, even though their meanings are not similar at all.

Contextual embeddings are a type of text embeddings that overcomes mentioned limitation, by taking into account the context in which a word or phrase appears, and generating a unique vector for each appearance of the word based on the surrounding words. This allows the model to capture the meaning of words in a more nuanced and accurate way, and to generate more relevant and optionally personalized results (by including personal preferences in the encoded context). Deep learning language models based on transformer, like BERT and GPT, are able to generate contextual embeddings.

Sentence embedding is a method for representing sentences, or even paragraphs as numerical vectors. This is typically done by generating contextual embeddings tor the tokens (words, or parts of the words) in the sentence, and then combining these token vectors in a way that captures the meaning of the sentence. There are several different methods for creating sentence embeddings, including averaging the word vectors, using a recurrent neural network, and training a separate model to generate sentence embeddings.

One significant constraint of the embeddings generated by transformer-based language models is a maximal length of the input text. By default, BERT models are able to handle no more than 512 subword tokens, approximately 350-400 words. There are special pretrained transformer models for longer documents, that can overcome this limitation. A new version of GPT based embedding model is able to handle input text up to 8192 tokens, making embeddings more convenient to work with long documents.

Neural Search in Practice

The process of comparing the embeddings of a query with the embeddings of the indexed items to find the most similar items is a key aspect of neural search. In the case of document search, the search engine typically compares the sentence embeddings of the query to the sentence embeddings of the indexed documents. This comparison is performed using a metric such as Euclidean distance, dot product, or cosine similarity. Depending on the applied sentence embedding method and how we want to treat the documents of different lengths, it may be necessary to normalize the vectors before comparison.

Since generating embeddings, especially sentence embeddings, is computationally intensive and time-consuming, neural search engines commonly generate the embeddings for indexed content during ingestion and store the document's embedding in a dedicated field in the indexed document.

The ingestion process should break real-world document into indexed documents not longer than the maximum size of the underlying embeddings generator’s input.

Inverted index, which are commonly used for fast full-text searches with traditional document retrieval models, are not applicable for embeddings, which are arrays of floats. Given a query embedding vector v that models the information need, the easiest approach for providing neural search would be to calculate the distance (Euclidean, dot product, etc.) between v and each vector d that represents a document in the corpus of information. This approach is quite expensive, so many approximate strategies are currently under active research, and some of them are applied in the new releases (after 2021) of enterprise search engines.

A search engine may use embeddings in combination with traditional retrieval models to improve the accuracy of search results. For example, a search engine it may use a traditional filter query to narrow down the documents that need to be scored first, then a VSM or BM25 for preliminary ranking, and finally embeddings for re-ranking of top n hits.

Since v9.0 (December 2021), Apache Lucene supports embeddings on a low level, indexing high-dimensionality numeric vectors to perform nearest-neighbor search, using the Hierarchical Navigable Small World graph algorithm.

Since v9.0 (May 2022), Apache Solr supports queries with embeddings, through DenseVectorField field type and K-Nearest-Neighbor (KNN) Query Parser.

уторак, 8. новембар 2022.

Extractive Question Answering

Abstract

Question answering (QA) is one of the most exciting applications of natural language processing (NLP). In an ideal scenario, a QA system should be able to answer questions directly from a collection of human-readable documents, such as Word, PDF, or HTML, without the need for time-consuming preprocessing steps like extracting question-answer pairs (FAQs). To provide an accurate answer to a question, the system must locate the most relevant chapter from the provided corpus (using "smart search"), and then extract (extractive QA) or generate (generative QA) an answer using the selected chapter as context.

The advances in deep neural network technology in the late 2010s, especially "transformer", have made it possible to develop extractive QA systems. However, in order to make these systems ready for use in production, a machine learning practitioner must address issues of scalability and latency.

In the 2020s, major vendors like Google, Amazon, and Microsoft have made significant improvements to their cloud-based machine learning infrastructure offerings. Platforms like Amazon SageMaker, a part of AWS suite, offer high-performance ML at scale and at reasonable prices. It looks like the time has come, even for small and midsize AI/NLP vendors, to use extractive QA in their products and platforms.

What is Extractive Question Answering?

Machine Reading Comprehension is the field of NLP where we teach machines to understand (or “understand”) unstructured text. One of the reading comprehension skills is ability to locate a segment of text, or span, in the corresponding reading passage (“context”), that represents the answer to a posed question. This skill is known as extractive question answering (extractive QA) because the answer is extracted from the text rather than generated. For example, the span representing the answer to the question "When do associates get paid?" is highlighted in the following passage:

Associates are paid bi-weekly every other Friday. Each paycheck will include earnings for all work performed through the end of the previous payroll period, less applicable deductions required by law or authorized by you.

Since 2018., machine learning based extractive QA systems have started to outperform humans on various benchmarks.

Here's how extractive QA fits in a Typical NLP pipeline:

The user enters an input, typically a question.
The system generates a search query based on the input and uses an underlying search engine (e.g., a web search engine like Google Search or an enterprise search engine like Solr) that employs an information retrieval strategy to find content related to the question.
Extractive QA provides an exact answer to the user's question, using the matched content as context, and outputs a span from the context that is the most likely answer to the posed question.
The system may return the span as a response to the user on its own, or it may highlight the span in the relevant passage or summary, as in the previous example.

The Technology Behind Extractive QA

Extractive QA may be considered a classification task, where the system has to select the correct span from all possible spans in the passage. Machine learning models can be trained on manually created datasets containing passages, questions, and spans that represent the answers. These models can then predict (or "infer") the spans that represent answers to previously unseen questions, using previously unseen passages as context.

In the inference phase, an extractive QA ML model takes as input two strings: a question and a context (a reading passage). The model's output, or "prediction," is a span from the context that is the most likely answer to the question. Typically, the state-of-the-art EQA models are based on a transformer deep learning language model, which is pre-trained using unsupervised learning on a large text corpus (e.g., BERTBASE pre-trained on English Wikipedia and BookCorpus). The model is then fine-tuned for EQA on a training dataset like SQuAD using supervised learning and a technique called transfer learning. Let's examine the concepts mentioned in the previous sentence in more detail.

Transformers

NLP involves processing sequences of data. For example, a sentence is a sequence of words, and each word is a sequence of letters. These sequences can vary in length, and the order of the data matters. Until recently ML practitioners used various flavors of Recurrent Neural Networks (RNN) to train ML models that take sequential data as input. There are two main limitations of RNN: the training is hard to parallelize, and the dependencies between distant elements of a sequence (e.g., between the words at the beginning and at the end of a long sentence) tend to vanish. In 2017, Google Brain team introduced the “transformer”, a new deep learning architecture designed to process sequential input data. Unlike RNNs, transformers do not have a recurrent structure, so training can be parallelized. The parallelized processing allows transformers to unleash the power of multi-core processing systems and use much larger training sets than RNNs. The transformer keeps track of dependencies using a technique called "attention", which eliminates the vanishing of dependencies between distant elements of a sequence.

Transfer Learning

Transfer learning is a machine learning technique that that involves reusing elements of a pre-trained model in a new machine learning model. Typically, a pre-trained model learns general knowledge using unsupervised learning on a large, unlabeled training set, and transfers this knowledge to a model that is trained on a labeled dataset to perform a specific task. This approach to machine learning development reduces the resources and amount of labelled data required to train new models. Transfer learning means that training does not need to be started from scratch for every new task. Training new machine learning models can be resource-intensive, so transfer learning saves both resources and time. The technique has been successfully used in image processing and NLP domains.

When applied to deep neural networks like transformers, which have many layers, transfer learning typically involves unfreezing a few of the top layers of a frozen pre-trained base model and jointly training both the newly added classifier layers and the last layers of the base model. This allows us to "fine-tune" the higher-order feature representations in the base model in order to make them more relevant for the specific task.

BERT

In 2018, another team from Google presented Bidirectional Encoder Representations from Transformers (BERT), a transformer-based machine learning technique for NLP, that enables generic pre-training of a model on a large unlabeled document corpus, and fine-tuning of the pre-trained model for a specific NLP task (e.g., question-answering, summarization, sentiment analysis…) with a smaller labeled training set using transfer learning. A pre-trained language model is the machine equivalent of a ‘well-read’ human being. It is fed by a large number of unannotated documents (for example, the complete Wikipedia), which allows the model to learn the usage of various words and how language is written in general. Such a pre-trained model may be fine-tuned for various NLP tasks, just like an athlete may be fine-tuned for a specific sport after general physical preparation. BERTBASE language model is trained on English Wikipedia and BookCorpus.

The generic pre-training of a language model doesn’t require any labeled training set, dictionaries, ontologies or any other kind of structured resources. It just needs a huge corpus of pure text documents, and a lot of processing power for unsupervised training on the predefined tasks. BERT was pretrained on two tasks: language modeling (15% of tokens were masked and BERT was trained to predict them from context), and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence), using English Wikipedia and BookCorpus as a training set. As a result of the training process, BERT learns contextual embeddings for words. A pre-training lasts several days on high-end virtual machines and costs several thousand dollars if performed on the rented computing infrastructure. However, you don't have to pre-train a BERT model on your own. Many variants of pre-trained BERT models, with different trade-offs between the size and the quality of the model, trained on various text corpora and various languages, are available online for free.

BERT Fine-Tuned for Extractive QA

A pre-trained BERT language model may be fine-tuned for extractive QA using supervised learning and a training set of question-answer pairs. Creating such a training set can be expensive, but fortunately, a team of researchers from Stanford University published the Stanford Question Answering Dataset (SQuAD) in 2016. It contains 107,785 question-answer pairs generated by crowd workers on a set of 536 Wikipedia articles. Fine-tuning a BERT language model for extractive QA using SQuAD takes a few hours.

Fine-tuned models, like general-purpose BERT models, can be saved and shared on the Internet. The Hugging Face model hub is the most popular platform for model-sharing, with tens of thousands of models of different sizes, for different languages and use cases, including several variants of BERT language models fine-tuned for extractive QA on SQuAD dataset.

Productization

During the process of making predictions (inference), an extractive QA machine learning model identifies the most likely span of text from the context to be the answer to the posed question. The latency and the scalability of inference depend on the model’s size and architecture, just as on the provided computing resources. While a single user may experience only a moderate amount of delay (1-2 seconds) when using this type of model on a general-purpose server, it may not be able to handle multiple users making requests simultaneously.

The specialized hardware based on GPUs or TPUs, suite ML inference much better than general-purpose processing units (CPU). Cloud providers platforms like Amazon AWS and Microsoft Azure can offer more cost-effective and feature-rich options for building and maintaining machine learning infrastructure, including scalability and security. It can be more cost-effective to use these services rather than building everything in-house. These vendors often offer innovative solutions like virtual servers that are only charged for active processing time, and they have access to specialized talent in areas like scalability and security.

уторак, 29. март 2016.

O*Net vs ESCO

Here's a brief comparative analysis of 2 most important public taxonomies related to skills and occupations.

	O*Net	ESCO
Summary	Aligned with US standards; RDBMS compatible; Interests profiler tool and other tools that may be embedded in a custom solution; API to open training, certification and job opportunities in the US; Covers general skills, abilities, interests, work values, work styles, tools and technologies and relates them with occupations.	Aligned with EU standards; Covers skills, but not Deep and wide hierarchy of skills; Doesn’t cover abilities, interests, work values and work styles; Multilingual.
Description	The ONET (The Occupational Information Network) program is the US's primary source of occupational information. Central to the project is the ONET database, containing information on hundreds of standardized and occupation-specific descriptors. The database, which is available to the public at no cost, is continually updated by surveying a broad range of workers from each occupation. Information from this database forms the heart of ONET OnLine, an interactive application for exploring and searching occupations. The database also provides the basis for Career Exploration Tools, a set of valuable assessment instruments for workers and students looking to find or change careers. ONET is being developed under the sponsorship of the US Department of Labor/Employment and Training Administration (USDOL/ETA) through a grant to the North Carolina Department of Commerce.	ESCO is the multilingual classification of European Skills, Competences, Qualifications and Occupations. ESCO is part of the Europe 2020 strategy. The ESCO classification identifies and categories skills, competences, qualifications and occupations relevant for the EU labour market and education and training. It systematically shows the relationships between the different concepts. ESCO has been developed in an open IT format, is available for use free of charge by everyone and can be accessed via the ESCO portal. The first version of ESCO was published on 23 October 2013. This release marks the beginning of the pilot and testing phase, including the ESCO mapping pilot. Until end of 2016 the classification will be completely revised. The final product will be launched as ESCO v1.
Data coverage	974 occupations from Standard Occupational Classification (SOC) system used by Federal statistical agencies described using 277 „descriptors“ organized into „The Content Model“. The model contains: required abilities (e.g. speech clarity, near/far vision), occupational interests, work values, work styles, basic skills (e.g. walking), cross-functional skills, domains of knowledge (e.g. Psychology), items related to prior educational experience required to perform in a job, items related to experience requirements, items related to occupational requirements and items related to occupation-specific information, „job zones“ that describe how much preparation for the job is needed, tasks for each occupation. "Tools and technologies", related to United Nations Standard Products and Services Code (UNSPSC) taxonomy.	5380 occupations, 5737 skills (e.g. C# programming), 20 qualifications (just POC for narrow domain at the moment, not useful) and their relations. Occupations, skills and qualifications are organized into a hierarchies, e.g. „nurse“ is a part of following occupation hierarchy: Technicians and associate professionals->Health associate professionals->Nursing and midwifery associate professionals->Nursing associate professionals->Nurse, medicine/surgery
Download format	The O*NET database is provided in five formats: · Microsoft Excel (XLSX) · Tab-delimited text files · SQL files for MySQL, PostgreSQL, or compatible relational databases · SQL files for Microsoft SQL Server · SQL files for Oracle Database	The ESCO classification is currently available for download in three data formats: · SKOS/RDF format: Full dataset with all concepts and relationships in all languages; works fine with Virtuoso triplestore · CSV format: Partial dataset with relationships or with concepts from one ESCO pillar in one language, e.g. for import into Microsoft Excel · XML format: Partial dataset with relationships or with concepts from one ESCO pillar in one language
Additional tools	RESTful web service API (XML response format) with end-point for occupations search by a keyword, industry, with bright outlook etc. It also provides embedding „O*NET Interest Profiler“ that suggests careers based on work activity preferences (acquired by a questionnaire).	-
Usage scenario:	Each user gets „job interest“ profile based on: - explicit selection of „descriptors“ (abilities, skills, values etc.) from GUI, - implicit detection of interests using ONet profiler, - implicit detection of „descriptors“ using NLP against textual profile. Assuming that each job is assigned with occupation from Standard Occupational Classification (SOC), „descriptors“ are implicitly assigned to the job using data from ONet database. Jobs are recommended to the users by comparison of descriptors assigned to the users and descriptors assigned to the jobs' occupations.	A set of skills is stored for each user by explicit selection from a hierarchy of skills using GUI or by implicit detection/recommendation of skills using NLP against textual profile. Each job is classified into ESCO occupation, manually or using NLP. Jobs are recommended to the users by comparison of users' skills and skills required for the occupations assigned to jobs.