Word Embeddings like Word2Vec and Glove have existed since 2013 and were (are) the go-to models still for several tasks. They suit the purpose, specifically when you have to leverage the weights learned over a large corpus of data, are constrained by a small train sample and want to optimise your inference costs.
However, the constraints they bring in with respect to OOV (out-of-vocabulary) words, especially made more pronounced due to mis-spellings which is extremely common in the Conversational AI world, make them hard to be usable for intent classifications in the conversational domain.
“pls, i need refend”
The above sentence practically translates to word embeddings which would be a vector of 0’s for the last word (for unknown word).
Character-based embeddings like FastText are much more useful to specifically handle these scenarios. We still continue to use this wherever the task doesn’t really require too much context.
With contextualised sub-word embeddings coming into the picture like ELMo (Embeddings from Language Model), ULMFiT (Universal Language Model Fine-tuning) and transformer models, they are now several times more powerful in terms of encoding the meanings of words and sentences.
Sebastian Ruder mentioned in “The Gradient” that NLP’s Imagenet moment has arrived. While we can go ahead and use BERT based embeddings, which is often the de-facto standard now for a whole lot of downstream tasks, it often fails to deliver results in a niche domain. Finetuning these embeddings to the specific domain is often required to achieve good results.
A comparison of base embeddings on the STS benchmark datasets from the Sentence-BERT paper by Reimers & Gurevych.
In a chatbot context, information retrieval forms a big part of what we do. While we leverage domain-specific models for client-agnostic intent categorisation, client-specific FAQs still need retrieval across 100s of intents with very few samples.
Pivoting this classification task as a similarity problem often has significant advantages. Primarily because we now need to focus on creating the right embedding space rather than classifying across multiple classes. And it enables us to increase the number of samples by creating several pairs and leverage a much bigger set.
From an engineering perspective of running these transformer models at scale, a similarity-based approach lends itself to a much faster inference, as we can compute and store these embeddings beforehand.
BERT models predict using a cross-encoder approach, where two pairs of sentences are passed to the network and the target is predicted. A cross-encoder approach doesn’t produce generalised sentence embeddings. A bi-encoder approach, on the other hand, is trained using a siamese network approach, where the two sentences can be passed independently and a fixed size sentence embedding can be derived.
Typically a bi-encoder kind of approach which generates standardised embedding is preferable to a cross-encoder approach that increases inference times and latencies. The Sentence-BERT paper by Reimers & Gurevych which came out in 2019 compares the pros and cons of the two approaches and the huge reduction in effort of finding the most similar pair from 65 hours to just 5 seconds using SBERT encodings.
In the Sentence Encoders on STILTs (Supplementary Training on Intermediate Labeled-data Tasks) paper by Phang et al., the authors supplemented language model-style pretraining with further training on data-rich supervised tasks to show significant gains on most of the GLUE (General Language Understanding Evaluation) tasks. The paper claims that in data-constrained regimes, the benefits of using STILTs yielded up to 10 point score improvements on some intermediate/target task pairs
We leveraged textual similarity datasets for task-specific finetuning and further finetuned only some layers using our in-domain few shot examples. Our experiments showed that they give a big boost over other approaches.
One of our biggest challenges was that the training samples for our datasets were much cleaner than the actual user utterances and it was important for the model to generalise well enough to perform well with real-world constraints.
Our data consisted of FAQ data from a set of clients across different domains.
We first created some baselines with sparse-encoding approaches. Our simple TF-IDF based approaches gave us Top-1 accuracies varying in the region of 40.7 to 48.5 for different sets of client data. Using a BM25, we got a Top-1 in the region of 43 and a Top-5 of 61.
Some of our experiments and their results on our datasets are shown below.
We experimented with simple embedding based models like USE to compare against finetuned transformer models. The following sub-set of experiments were considered for comparison.
- Finetuning last few layers with in-domain few-shot examples using a BERT based transformer model (which wasn’t trained on any textual similarity task) with a classifier head to classify similar vs dissimilar
- Using embeddings from BERT based Transformer model which was trained for a textual similarity task on open-source datasets & leveraging them in a similarity-based model
- Using embeddings after finetuning the last few layers with in-domain few-shot examples using the BERT based Transformer models which were trained on an intermediate task on textual similarity tasks on open-source datasets
All accuracies were computed on a set of manually annotated actual user utterances which were mapped to intents.
As seen from the table, we got the best accuracies when we finetuned with few-shot examples, a BERT style transformer model which was already further trained on a textual similarity task from open-source datasets.
Hard-sampling was tried for the final fine-tuning, both using word overlaps and distance-based methods, however, it didn’t provide an appreciable boost in the datasets we used.
Retrieval + re-ranking based approaches are also very common. However using a BM25 kind of approach which is based on the lexical overlap, has its own drawbacks when there’s very less word overlap between the train set and the actual set of user utterances with mis-spellings. In our dataset, BM25 as a retrieval method caused the accuracy to drop. Embeddings methods for retrieval proved much better, without affecting the latency too much.
Post retrieval, there are options for re-ranking like using cross-encoders. The result set is re-ranked using more efficient but slower cross-encoders. However, directly using the re-ranker might not give too much benefit, unless the cross-encoder is also fine-tuned for the task. In our experiments, using cross-encoders trained on textual similarity open-source datasets without fine-tuning on our samples, didn’t give us any accuracy boost.
Since we intend to use re-ranking in a more discriminative fashion, it might make sense to use more fine-grained labels than just 0 and 1.
Our team continues to experiment with multiple approaches – those that give us good accuracies and performance along with better generalisation.
As part of the future scope, we will continue to research better techniques to incorporate newer utterances for automatic model retraining. Some of our experiments will focus on leveraging models based on recent advances in unsupervised learning approaches like SimCSE and TSDAE