уторак, 8. новембар 2022.

Extractive Question Answering

 Abstract

Question answering (QA) is one of the most exciting applications of natural language processing (NLP). In an ideal scenario, a QA system should be able to answer questions directly from a collection of human-readable documents, such as Word, PDF, or HTML, without the need for time-consuming preprocessing steps like extracting question-answer pairs (FAQs). To provide an accurate answer to a question, the system must locate the most relevant chapter from the provided corpus (using "smart search"), and then extract (extractive QA) or generate (generative QA) an answer using the selected chapter as context.

The advances in deep neural network technology in the late 2010s, especially  "transformer", have made it possible to develop extractive QA systems. However, in order to make these systems ready for use in production, a machine learning practitioner must address issues of scalability and latency.

In the 2020s, major vendors like Google, Amazon, and Microsoft have made significant improvements to their cloud-based machine learning infrastructure offerings. Platforms like Amazon SageMaker, a part of AWS suite, offer high-performance ML at scale and at reasonable prices. It looks like the time has come, even for small and midsize AI/NLP vendors, to use extractive QA in their products and platforms.


What is Extractive Question Answering?

Machine Reading Comprehension is the field of NLP where we teach machines to understand (or “understand”) unstructured text. One of the reading comprehension skills is ability to locate a segment of text, or span, in the corresponding reading passage (“context”), that represents the answer to a posed question. This skill is known as extractive question answering (extractive QA) because the answer is extracted from the text rather than generated. For example, the span representing the answer to the question "When do associates get paid?" is highlighted in the following passage:

Associates are paid bi-weekly every other Friday. Each paycheck will include earnings for all work performed through the end of the previous payroll period, less applicable deductions required by law or authorized by you.

Since 2018., machine learning based extractive QA systems have started to outperform humans on various benchmarks.

Here's how extractive QA fits in a Typical NLP pipeline:

  • The user enters an input, typically a question.
  • The system generates a search query based on the input and uses an underlying search engine (e.g., a web search engine like Google Search or an enterprise search engine like Solr) that employs an information retrieval strategy to find content related to the question.
  • Extractive QA provides an exact answer to the user's question, using the matched content as context, and outputs a span from the context that is the most likely answer to the posed question.
  • The system may return the span as a response to the user on its own, or it may highlight the span in the relevant passage or summary, as in the previous example.

The Technology Behind Extractive QA

Extractive QA may be considered a classification task, where the system has to select the correct span from all possible spans in the passage. Machine learning models can be trained on manually created datasets containing passages, questions, and spans that represent the answers. These models can then predict (or "infer") the spans that represent answers to previously unseen questions, using previously unseen passages as context.

In the inference phase, an extractive QA ML model takes as input two strings: a question and a context (a reading passage). The model's output, or "prediction," is a span from the context that is the most likely answer to the question. Typically, the state-of-the-art EQA models are based on a transformer deep learning language model, which is pre-trained using unsupervised learning on a large text corpus (e.g., BERTBASE pre-trained on English Wikipedia and BookCorpus). The model is then fine-tuned for EQA on a training dataset like SQuAD using supervised learning and a technique called transfer learning. Let's examine the concepts mentioned in the previous sentence in more detail.

Transformers

NLP involves processing sequences of data. For example, a sentence is a sequence of words, and each word is a sequence of letters. These sequences can vary in length, and the order of the data matters. Until recently ML practitioners used various flavors of Recurrent Neural Networks (RNN) to train ML models that take sequential data as input. There are two main limitations of RNN: the training is hard to parallelize, and the dependencies between distant elements of a sequence (e.g., between the words at the beginning and at the end of a long sentence) tend to vanish. In 2017, Google Brain team introduced the “transformer”, a new deep learning architecture designed to process sequential input data. Unlike RNNs, transformers do not have a recurrent structure, so training can be parallelized. The parallelized processing allows transformers to unleash the power of multi-core processing systems and use much larger training sets than RNNs. The transformer keeps track of dependencies using a technique called "attention", which eliminates the vanishing of dependencies between distant elements of a sequence.

Transfer Learning

Transfer learning is a machine learning technique that that involves reusing elements of a pre-trained model in a new machine learning model. Typically, a pre-trained model learns general knowledge using unsupervised learning on a large, unlabeled training set, and transfers this knowledge to a model that is trained on a labeled dataset to perform a specific task. This approach to machine learning development reduces the resources and amount of labelled data required to train new models. Transfer learning means that training does not need to be started from scratch for every new task. Training new machine learning models can be resource-intensive, so transfer learning saves both resources and time. The technique has been successfully used in image processing and NLP domains.

When applied to deep neural networks like transformers, which have many layers, transfer learning typically involves unfreezing a few of the top layers of a frozen pre-trained base model and jointly training both the newly added classifier layers and the last layers of the base model. This allows us to "fine-tune" the higher-order feature representations in the base model in order to make them more relevant for the specific task.

BERT

In 2018, another team from Google presented Bidirectional Encoder Representations from Transformers (BERT), a transformer-based machine learning technique for NLP, that enables generic pre-training of a model on a large unlabeled document corpus, and fine-tuning of the pre-trained model for a specific NLP task (e.g., question-answering, summarization, sentiment analysis…) with a smaller labeled training set using transfer learning. A pre-trained language model is the machine equivalent of a ‘well-read’ human being. It is fed by a large number of unannotated documents (for example, the complete Wikipedia), which allows the model to learn the usage of various words and how language is written in general. Such a pre-trained model may be fine-tuned for various NLP tasks, just like an athlete may be fine-tuned for a specific sport after general physical preparation. BERTBASE language model is trained on English Wikipedia and BookCorpus.

The generic pre-training of a language model doesn’t require any labeled training set, dictionaries, ontologies or any other kind of structured resources. It just needs a huge corpus of pure text documents, and a lot of processing power for unsupervised training on the predefined tasks. BERT was pretrained on two tasks: language modeling (15% of tokens were masked and BERT was trained to predict them from context), and next sentence prediction (BERT was trained to predict if a chosen next sentence was probable or not given the first sentence), using English Wikipedia and BookCorpus as a training set. As a result of the training process, BERT learns contextual embeddings for words. A pre-training lasts several days on high-end virtual machines and costs several thousand dollars if performed on the rented computing infrastructure. However, you don't have to pre-train a BERT model on your own. Many variants of pre-trained BERT models, with different trade-offs between the size and the quality of the model, trained on various text corpora and various languages, are available online for free.

BERT Fine-Tuned for Extractive QA

A pre-trained BERT language model may be fine-tuned for extractive QA using supervised learning and a training set of question-answer pairs. Creating such a training set can be expensive, but fortunately, a team of researchers from Stanford University published the Stanford Question Answering Dataset (SQuAD) in 2016. It contains 107,785 question-answer pairs generated by crowd workers on a set of 536 Wikipedia articles. Fine-tuning a BERT language model for extractive QA using SQuAD takes a few hours.

Fine-tuned models, like general-purpose BERT models, can be saved and shared on the Internet. The Hugging Face model hub is the most popular platform for model-sharing, with tens of thousands of models of different sizes, for different languages and use cases, including several variants of BERT language models fine-tuned for extractive QA on SQuAD dataset.

Productization

During the process of making predictions (inference), an extractive QA machine learning model  identifies the most likely span of text from the context to be the answer to the posed question. The latency and the scalability of inference depend on the model’s size and architecture, just as on the provided computing resources. While a single user may experience only a moderate amount of delay (1-2 seconds) when using this type of model on a general-purpose server, it may not be able to handle multiple users making requests simultaneously.

The specialized hardware based on GPUs or TPUs, suite ML inference much better than general-purpose processing units (CPU). Cloud providers platforms like Amazon AWS and Microsoft Azure can offer more cost-effective and feature-rich options for building and maintaining machine learning infrastructure, including scalability and security. It can be more cost-effective to use these services rather than building everything in-house. These vendors often offer innovative solutions like virtual servers that are only charged for active processing time, and they have access to specialized talent in areas like scalability and security.