10 min
Introduction
As part of the development of its Graph-RAG project, Lettria was confronted with a problem specific to natural languages: disambiguation. The inherent ambiguity in natural language lexicons must be addressed to represent information accurately. Among various disambiguation methods, supervised learning approaches have proven effective. These approaches require training a language model with annotated data. Therefore, Lettria has implemented a semantic annotation project to enhance its unstructured data processing capabilities. Let's explore each step of the journey: creating semantic categories, selecting corpora, training annotators, organizing annotation sessions, and implementing annotation quality assessment tools.
Preparatory work for annotation
Semantic categorization
Training a language model for disambiguation necessitates providing annotated data, specifically text corpora with indicated word meanings. For example, in the sentence We saw a bat fly away, the word bat would be tagged with the Animal category or one of its subclasses, whereas in The player holds his bat properly, it would be tagged with the Sports_instrument category. Using context-based statistics, the model can then predict a word's meaning from its various possible meanings.
The first step was to create the semantic categories designed to be assigned to words during annotation. Lettria developed lists of semantic categories for three parts of the lexicon: around 500 for verbs, a thousand for nouns, and a hundred for adjectives. The categories are valid across all languages as they abstractly represent the concepts denoted by words. The creation of these semantic categories, organized in graphs, and some encountered challenges have been described in detail here.
Notably, semantic category graphs were occasionally modified during initial annotation campaigns to address missing categories, redundant categories, or to refine distinctions. However, such modifications are limited over time, as they necessitate re-annotation of previously tagged data and interfere with annotators' knowledge of categories.
Corpus selection
Selecting the corpora to be annotated is the next step. A machine learning model performs best on data similar to its training data. Therefore, it is advisable to choose corpora based on current or anticipated customer data.
Several criteria must be considered, including:
- Balance between written and transcribed oral data.
- Text type: journalistic, social media posts, lists, product descriptions, etc.
- Lexicon type, particularly specialized terminology such as finance, health, or e-commerce.
- Text quality in terms of syntax, spelling, and punctuation.
Lettria's customers have diverse activities and resources, so various corpus types were chosen for annotation, including:
- Journalistic texts from various sources and the content of Wikihow, offering both high writing quality and a wide lexical range.
- The Oscar corpus, containing unrestricted web texts, and providing diverse vocabulary, styles, and quality.
- Product data sheets from a merchant site, enabling testing of disambiguation on specific vocabulary and styles.
- To a lesser extent, transcriptions of oral conversations, addressing the peculiarities of this speech type. Although most of Lettria's customer data is written, we may indeed process transcribed oral data such as customer service calls.
Each corpus type was annotated in English and French, with each part of speech annotated separately.
Annotation training
Lettria assembled a team of 10-15 annotators led by the head of the linguistic team, the project’s “super annotator”. The team underwent extensive training to master all semantic categories and consolidate the linguistic knowledge necessary for high-quality annotation. Topics included for example the different types of pronominal verbs in French, factive structures, and the passive voice. All these subjects are crucial for correctly identifying the semantic subject of verbs and selecting the appropriate category.
To assess their knowledge, annotators were also invited to test their skills with a monthly quiz. This tool has a formative virtue: it's a good way of realizing one's mistakes and remembering the right answer. It also enables us to see which cases have been the least well understood, and to return to these subjects during additional training sessions at the start of the annotation sessions.
Annotation in practice
The annotation platform
The annotation was conducted on the Lettria platform provided for this purpose, also used for other classification work carried out internally or by our clients. This platform has been designed to make annotation as fluid and efficient as possible. Thanks to preparatory syntactic labeling, words belonging to the desired grammatical category (verb, noun or adjective) were underlined. A simple click on each word displayed the list of available semantic categories.
A dedicated field allowed the search for a specific category by typing its name, or, when we were not sure which category was most suitable or didn't know its name, to search for it by typing associated words. When preparing a campaign on the platform, it is possible to associate comments with each label. In this case, definitions, examples of words that could belong to this category and sample sentences were indicated.
Annotation sessions
The team took part in the annotation during weekly one-hour sessions. Each corpus gave rise to one or more campaigns of a given number of sentences, which were automatically distributed among the annotators as the annotations progressed. The duration of each campaign was determined in relation to the timetable of Lettria's objectives. The corpora of the two languages were sometimes annotated in parallel, sometimes successively, but always for the same grammatical category at the same period, so as not to disrupt good knowledge of the categories.
While each annotator could annotate independently in the dedicated space on the Lettria platform, it was important for everyone to annotate at the same time, in the same room and/or in the same online meeting. This enabled each one to exchange views with other annotators and benefit from immediate answers from the super annotator when in doubt. This collaborative environment enhanced the learning process.
After the first campaigns, we implemented pre-annotation. The first campaigns were used to train the model in order to obtain its predictions on subsequent campaigns for the same part of speech. Thus, for each word concerned, a semantic category was proposed, which the annotator could validate or replace with the one of his choice.
Here, the risk was that the user could be satisfied with the proposal when a better category could have been chosen. But this assumption can be counterbalanced by other points:
- We have spotted very few cases of categories that are both wrong and plausible.
- The model could in some cases choose a more relevant category than the one the annotator would have chosen.
- In any case, it is undeniable that this pre-annotation increased speed and therefore enabled more data to be annotated, which had an inevitably positive impact on model training.
Quality assessment
Naturally, for the model to disambiguate correctly, the data must be correctly annotated. Inconsistent or incorrect annotations lead to biased or inefficient models. This means that annotators had to select the same categories for similar cases. The aim was to give the machine consistent information, not to let the machine determine the best category from divergent information.
To ensure data quality, the double annotation solution was implemented. Each sentence had to be annotated by two different people. At the end of each session, the sentences that had not been annotated in the same way were observed. The two annotators concerned tried to reach agreement, and if not, the super annotator was responsible for deciding. Of course, it was always possible for two people to make the same annotation mistake, but the risk is more limited than if each sentence was annotated by a single person.
This correction stage was carried out directly in the platform, with a dedicated view showing the list of annotated sentences. The consensus between the different annotators was indicated, allowing to quickly spot sentences that did not achieve 100% agreement. Corrections were made directly using the Edit button.
Conclusion
To execute its natural language disambiguation project, Lettria established a comprehensive corpus annotation process for model training, an important work preceding that of training the model with the data and evaluating the results. Each phase, from creating semantic categories to verifying annotation quality, required precise decisions and expertise in language subtleties and learning models. The quality provided to the project is essential in Lettria's approach. Successful disambiguation is indeed a major asset to improve the creation of knowledge graphs, particularly for GraphRAG, and thus offer powerful knowledge extraction solutions.