Discover how we are revolutionizing sentiment analysis by incorporating the game-changing AdapterFusion technique, overcoming catastrophic forgetting and enabling efficient multi-task learning. Learn about adapters' lightweight architecture and their superior performance in our detailed case study.
Imagine struggling to efficiently train a language model on multiple tasks, only to find it forgets previous learnings as new information is acquired. This is the pain point that countless researchers and developers have faced while working on sentiment analysis models.
The inability to retain and share knowledge from multiple tasks has hindered progress and led to the dreaded phenomenon known as catastrophic forgetting.
Enter the game-changing solution: Adapters. These innovative components endowed with a lightweight architecture are transforming the landscape of sentiment analysis as they can be trained faster, with fewer GPU resources compared to large language models and without degrading the performance.
The secret to adapters' success lies in their ability to overcome transfer learning challenges, combining knowledge from diverse datasets non-destructively and without compromising performance.
Welcome to the next generation of sentiment analysis, driven by the cutting-edge AdapterFusion technique. It's time to leave catastrophic forgetting behind and embrace a future with intelligent, versatile models capable of effectively learning and sharing knowledge from multiple tasks.
In this blog post, we explore the potential of adapter architectures for sentiment analysis tasks, specifically using the AdapterFusion technique.
We begin by providing a background on adapters and the Transformer model, highlighting their advantages over traditional fine-tuning, with a focus on their lightweight architecture.
Then, we walk through a detailed example of using adapters for sentiment analysis, evaluating both single adapter and fusion models.
→ Our results show that the two models outperform the one currently in production, but surprisingly, the fusion model does not significantly outperform the single adapter. However, further experimentation is required to draw a more robust conclusion.
Adapters are essentially lightweight layers added to BERT models, which provide two remarkable advantages:
While having significantly fewer training parameters, adapters perform on par with fully finetuned models, making them an efficient choice.
Adapters excel in transfer learning, as they can learn from different datasets and effectively combine this knowledge using a non-destructive task composition that employs dynamic attention mechanisms.
By leveraging adapters, we can create models that share knowledge from various tasks while avoiding catastrophic forgetting, ensuring a reliable and stable learning process.
Project Progress and Milestones
We started this project on February 1st, with the objective of completing the sentiment and emotion model by April.
The model's design allows for ongoing enhancements based on internal assessments and valuable client feedback, ensuring that it consistently meets and exceeds expectations.
Adapters Architecture: Unleashing the Power
Adapters are a powerful architectural technique for transformer models that allow for efficient fine-tuning without the need to retrain the entire model.
By inserting small, trainable modules between the layers of a pre-trained model, adapters enable customization for specific tasks while reducing the number of trainable parameters, leading to faster training times and lower memory requirements.
This makes adapters an attractive option for natural language processing tasks, such as sentiment and emotion prediction, where quick adaptation and high performance are essential.
Adapter layers consist of a down-projection matrix, a nonlinearity function, an up-projection matrix, and a residual connection.
A key hyperparameter is the bottleneck dimension m, that appears in the reduction_factor definition below. This factor defines the ratio between a model's layer hidden dimension and the bottleneck dimension:
Since m << d, the bottleneck architecture significantly reduces the number of parameters and acts as an encoder able to filter out crucial information.
In practice, adapters train the equivalent of 0.5% to 8% of the original model's parameters, which allows to save storage space for the model and reduce the training time.
AdapterFusion: Knowledge Extraction and Composition
AdapterFusion combines several single adapters, each trained independently on a specific task or dataset. The resulting adapter is a fusion of single adapters. This process involves two stages:
Knowledge Extraction: Learning task-specific parameters for each adapter independently.
Knowledge Composition: Efficiently combining knowledge in a non-destructive manner.
This method allows adapters to avoid catastrophic forgetting more effectively than sequential finetunings on multiple tasks.
In simple terms, Contextual Activation is a technique used during the knowledge composition stage to improve the performance of the model by combining the knowledge from multiple trained adapters.
This is done by using an attention mechanism, first introduced in the paper Attention Is All You Need by A. Vaswani et al., 2017.
This helps the model focus on the most useful information for a given input. This approach makes it possible to train adapters on sentiment and emotion datasets more effectively than traditional transfer learning methods.
AdapterFusion learns a parameterized mixer of available trained adapters and identifies the most useful adapters for a given input.
Parameters Ψ used in AdapterFusion consist of Key, Value, and Query matrices at each layer l, denoted by Kl, Vl, and Ql. By using the output of each adapter zl,t as input to Key and Value, AdapterFusion learns contextual activation.
Ultimately, this powerful approach enables the effective training of adapters on sentiment and emotion datasets while overcoming the limitations of traditional transfer learning techniques.
In this section, we will discuss the main points regarding the implementation of adapters and their application in our use case.
Tasks to Perform
Predicting sentiments and emotions is a complex challenge due to the intricate nature of natural languages. For example, a single word can express a wide range of sentiments and emotions, and changing or adding words may alter the meaning and interpretation.
Fortunately, both tasks share many common aspects, such as the use of adapter-transformer models and the configuration of hyperparameters during training, even though the prediction tasks differ. Sentiment analysis is a multi-classification problem, while emotion analysis is a multi-label classification problem.
It is crucial to collect feedbacks and take advantage of the modularity of adapters to improve the model for each client. These predictions may be entirely based on client decisions, which requires fine-tuning on data as close as possible to theirs. For example, there are two ways to predict sentiment:
Speaker-based sentiment: I really love making fun of people with my friends would be rather categorized as Positive.
Moral sentiment: I really love making fun of people with my friends would be categorized as Negative.
Review analysis will generally be based on the first type of sentiment, while a bank, for instance, will opt for an analysis based on moral criteria.
The desired output for sentiment prediction is a float number in the range of [-1; 1] (-1 for negative, 0 for neutral, and 1 for positive). However, the datasets used to train the models are generally made for multi-classification problems, as the target consists of discrete labels (binary, ternary, or five-scale-point labels).
The goal is to implement a multi-classification training that can learn from datasets with an arbitrary number of labels and return a continuous output in the range of [-1; 1]. This can be easily achieved with an affine transformation.
The final output for emotion prediction will be emotions identified from a list of predefined emotions. The goal is to output a probability for each of the following 28 possible emotions:
The choice of these emotions was directly inspired by the goemotions dataset, which is a human-annotated dataset of 58k Reddit comments in the English language. It features 27 emotions, including 12 positive, 11 negative, 4 ambiguous emotion categories, and 1 additional class: “neutral”.
Presentation of Datasets
The idea is to use a variety of training datasets for our model to acquire maximum knowledge and generalize as much as possible, i.e., predict well on unseen input data.
Likewise, in terms of assessment, the more datasets, the better. In particular, before putting a model into production, it is recommended to assess it on client data (possibly quite noisy) to better anticipate how the model will perform.
Here are some interesting public datasets for sentiment analysis containing a reasonably high number of samples:
Most of these datasets are in English and correspond to reviews. As a result, our model may be more likely to classify reviews than other types of data, which works fine for us at Lettria since reviews are what we mostly work on predicting. However, it can be limiting for other types of data.
The majority of these datasets have binary labels, which can be problematic for predicting positive, neutral, and negative classes. On the other hand, 5-label datasets contain much more useful information for classifying sentiments into three classes, even though training may take longer due to increased complexity.
Among these datasets, Amazon-Reviews-Multi has been chosen. It includes 200,000 samples with reviews containing at least 20 characters and has 5 labels.
Dataset to Assess Sentiment Predictions
Our models will be assessed on an anonymized French dataset containing about 661 samples. The labels have been carefully and manually defined, and inputs that are too ambiguous have been eliminated.
As observed in the pie chart, the dataset remains quite imbalanced due to the small number of neutral labels.
The public dataset go_emotions available on HuggingFace in English will be used.
It corresponds to a human-annotated dataset of 58k Reddit comments extracted from popular English-language subreddits and labeled with the 28 emotion categories defined above.
It contains 211,225 annotations, provided by 82 annotators, of 58,011 text examples found on Reddit. Each input text was annotated by several annotators to reach a consensus given the subjectivity of the annotations.
To predict emotions in French and English, the original dataset was translated into French and Spanish using DeepL API.
Settings and Configurations
The adapter-transformer architecture will be inserted into a multilingual pre-trained transformer-based model: XLM-RoBERTa-Base to handle English, French, Spanish (and other languages in the future). It has been trained on Masked Language Modeling and contains about 280 million parameters.
Compared to the BERT model used, there are far fewer trainable parameters: only 1.5 million parameters, i.e., 0.5%.
The idea is to monitor the model's metrics as closely as possible to understand its performance. The use of MLFlow will be particularly helpful in obtaining a thorough assessment of the model's performance.
Before optimizing the model, the datasets were preprocessed. This includes:
Removal of undesired patterns unsupported by data manipulation libraries
Translating the datasets to obtain French, English, and Spanish versions
Standardizing the dataset labels to handle datasets with integer labels starting from 0
Once the datasets are ready for use, optimization consists in finding appropriate configurations for:
lr: the learning rate
distribution_weights_loss: weights to take into account dataset imbalance
loss: the choice of the loss function
activation_function: it is possible to use softmax and consider emotion analysis as a multi-classification problem at the expense of a multi-label one
Cross Entropy Loss will be used as the loss function for sentiment analysis, since it’s a multi-class classification problem. For emotion analysis, binary cross-entropy loss can be used since it is a multi-label classification problem.
Want to learn how to build a private ChatGPT using open-source technology?
To evaluate our model's performance on multi-class (and multi-label classification problems too), we can use tools like confusion matrices and sklearn's classification reports. These provide detailed results for each class, helping us understand how well the model performs. Since our test dataset is imbalanced, accuracy might not be the best metric. Instead, we'll focus on the F1-Score, which takes into account both precision and recall and is well-suited for imbalanced datasets.
Experiment: AdapterFusion for Combining Languages
According to research by J. Pfeiffer et al., AdapterFusion is an appealing feature of adapter-based models. We'll explore whether it can improve our model's performance. Previous experiments have shown that the Amazon-Reviews-Multi dataset is particularly useful for training single adapters. As a result, all the models in this experiment will be trained on this multilingual dataset.
We'll train our models on 90,000 samples in French, English, and Spanish, using a batch_size of 8 and a lr of 1e-4. The first model (SingleAdapterModel) will be trained on a combined dataset of all three languages. The second model (AdapterFusionModel) will fuse three single adapters, each trained on a separate language dataset, and then fine-tune the fused model on the combined dataset.
It's important to note that we can't conclusively say whether fusion is better than single adapters based solely on this experiment. A more thorough comparison would involve averaging the results over multiple trainings with different configurations and displaying the standard deviations.
Summary of Results
The results of the experiment show that both the SingleAdapterModel and AdapterFusionModel outperform the current model in production in terms of F1-score and accuracy. The SingleAdapterModel achieved an accuracy of 0.871 and a macro average F1-score of 0.664. In comparison, the AdapterFusionModel had an accuracy of 0.861 and a macro average F1-score of 0.674.
While both models show improvements over the current production model, the differences between them are not significant enough to conclusively determine the superiority of one approach over the other.
Interestingly, the fusion model does not significantly outperform the single adapter.
Since we're using a multilingual base model (XLM-RoBERTa-base Model), it's possible that training on a multilingual dataset doesn't cause catastrophic interference issues. According to the paper by J. Pfeiffer et al., fusion models perform better than single adapters on low-resource datasets (those with limited data for inference), while single adapters perform slightly better on high-resource datasets (those with ample data available).
If the performance of both approaches is comparable, the fusion model requires more storage space: the ONNX model takes up 1.13GB compared to 1.04GB for the single adapter. Additionally, it might be more convenient to enhance a single model rather than a fusion one.
Ultimately, it's important to remember that one experiment is not enough to draw definitive conclusions.
However, what's evident is that both models tested significantly outperform the current production model with the given configurations.
Romain, a dedicated Data Scientist at Lettria, is passionate about Computer Science and Optimization. In addition to his professional pursuits, he is an enthusiastic member of a rock band and has a love for golf.