5 min
Introduction
As a major player in knowledge extraction, Lettria develops tools to process unstructured data in order to obtain relevant information. Thanks to Text-to-Graph, data present in written documents are transformed into graphs, which then serve as a knowledge base exploited by a GraphRAG solution. While building these tools, Lettria was confronted with a major challenge for natural language processing (NLP): lexical ambiguity. To address this, the company adopted a machine learning approach and launched a comprehensive semantic annotation project for written corpora. While the quality of annotations was paramount, equally crucial were the selection of the model to be trained, the evaluation methods, and the analysis of results. These are the steps that we're unveiling here.
Model training
Training an NLP model for disambiguation involves several stages and requires careful consideration of the model type, training data, and optimization techniques. Various models can be employed for NLP tasks, ranging from traditional machine learning methods like support vector machines (SVMs) to advanced deep learning models such as neural networks and transformers.
Model selection
Lettria used AutoLettria, a platform module specifically designed for training NLP models. Central to our approach is the XLM-RoBERTa transformer, transformers being a type of neural network model. XLM-RoBERTa is a high-performance, multilingual version of RoBERTa (Robustly optimized BERT approach). This model is particularly advantageous due to its ability to handle multiple languages, which aligns perfectly with Lettria's multilingual objectives.
Another advantage of XLM-RoBERTa is that it is pre-trained on a vast amount of text in 100 languages using a masked language modeling (MLM) objective. Its architecture allows it to capture contextual information and semantic nuances across different languages, making it highly effective for tasks requiring nuanced understanding, such as word sense disambiguation (WSD).
Training preparation
Once the annotated data is ready, we fine-tune the XLM-RoBERTa model. Fine-tuning involves adjusting the pre-trained model weights on our specific dataset, enabling it to learn the disambiguation task. The training process includes several epochs, where the model iteratively adjusts its parameters to minimize the loss function.
This step is followed by the hyperparameter optimization. Key hyperparameters such as learning rate, batch size, and the number of training epochs are optimized to improve model performance. Techniques like grid search or random search are used to identify the best combination of hyperparameters.
Evaluation methods
Once training is complete, it's time for evaluation. There were two methods for evaluating the model: in addition to the “classic” evaluation, we have specifically created a test corpus.
With the classic method, it's a kind of self-evaluation of the model. The model performs a random split of the annotated data: 70% of the data is used for training, 15% for the validation of the above-mentioned parameters, while the remaining 15% serves as the test set. The trained model annotates the test set, and its predictions are compared to the human-annotated versions. This method provides an initial measure of the model's performance.
For a more robust evaluation, we created specific test sets with carefully annotated examples. This method involved selecting sentences with highly polysemous words (words with multiple meanings) to test the model's ability to disambiguate accurately. Two datasets were generated for each part of speech, one in English and one in French. Lettria's linguistic team selected sentences from the Oscar corpus, ensuring a balanced representation of all meanings for each polysemous word. These sentences were then annotated during dedicated annotation sessions.
Results analysis
The performance of the model was evaluated using several metrics, each providing insights into different aspects of its accuracy and reliability:
- Accuracy: The overall proportion of correct predictions made by the model. While useful, accuracy can be misleading in cases of class imbalance.
- Precision: The ratio of true positive predictions to the total positive predictions made by the model. Precision is crucial when false positives are costly.
- Recall: The ratio of true positive predictions to the total actual positive instances in the data. Recall is important when the cost of false negatives is high.
- F1-Score: The harmonic mean of precision and recall, offering a balanced measure that is particularly useful in the presence of class imbalance, as is the case in our project.
Depending on these different metrics, grammatical categories and two evaluation methods, the model's evaluation yields variable results, always above 50% and up to a 73.40% F1-score for verb disambiguation in the last training run. These figures are particularly promising, especially considering the number of categories. The progress made in improving the quality of trained data suggests that even better results could be achieved with further data annotation.
Conclusion
Lettria's approach to semantic disambiguation through machine learning represents a significant advancement in its development. By emphasizing high-quality annotated data and robust evaluation methods, we develop high-performance models capable of accurately understanding and processing the intricacies of natural languages. Disambiguation is a crucial component of Lettria's ultimate goal. By improving graphs and ontologies, it actually contributes to the success of GraphRAG and knowledge extraction in general, thus allowing companies to obtain key information for their business.