Lettria's semantic categorization project as part of the development of a NLP solution

Lettria's textual data management platform is a natural language processing (NLP) tool based on several years of research and development involving algorithms, data science, software engineering and linguistics. Linguistics is at the heart of the work, as natural languages are the raw material for information. An essential step in the formalization of textual data, semantic categorization aims at identifying the type of meaning of each word in a given language production, the word being in theory the smallest autonomous signifying unit. 

The creation of semantic categories thus participates in the structuring of data, but also helps to face a major problem in the processing of natural languages, that of ambiguity. This work is not without its obstacles. Phrasemes are one of them, as well as the syntactic and semantic particularities of verbs, especially pronominal verbs in French. All these pitfalls have led Lettria to make decisive choices, always with the objective of obtaining the most relevant results.


Information structuration

Information structuration requires the identification of entities and events evoked in a text, as well as their properties and their relationships. To do this, Lettria has created several hundred semantic categories organized within graphs. The categories group words of the same part of speech with common semantic features, regardless of the language. Each of them has a definition and each word in a text must correspond to a single category. For example, the definition of the category aircraft is "air transport vehicle" and includes the English words plane and helicopter as well as the French words avion and hélicoptère. The definition of the category danger is "dangerous thing; that which threatens something or somebody" and includes the English words danger, peril and risk and the French words danger, péril and risque.

In general, nominal categories have the same relationships to each other as nouns have to each other in terms of hypernymy and hyponymy. Howard Jackson defines hyponymy as follows:

"Hyponymy refers to the semantic relation of inclusion: the meaning of a (more specific) lexeme is included in that of another (more general) lexeme. Thus nova is a hyponym of star: star is the superordinate term, and nova is the subordinate term in this semantic relation. Hyponymy is often important in displaying the semantic relations between items within a lexical field." (Howard JACKSON, Words and their Meaning, p. 92, Routledge, 1988)

Hyperonymy is therefore the inverse relationship of hyponymy. These relations can be found in the structure of our graphs: the category mammal has a more general meaning than the category carnivore, whose meaning is itself more general than the category dog

The category skin_illness has a more precise meaning than the category illness

It is therefore logical that the nominal categories designating concrete entities are hierarchically arranged in a taxonomic type of graphical representation going from the most general to the most particular. 

This organization facilitates the creation and the use of categories. The concept of taxonomy, typically associated with attempts to describe the living world, is explained by Jacques Ruffié: 

"Classification, or taxonomy, is therefore the process of gathering, through successive levels, living organisms that have more and more common characteristics. In doing so, we go from higher classification levels (phyla, classes) to lower classification levels (orders, families, species)." (Jacques RUFFIÉ, De la biologie à la culture, p. 18, Flammarion, 1976)

Thus, the category plant, which contains the words plant and vegetal, has several subcategories including tree which contains the words tree and shrub. The tree category itself has several subcategories including conifer, which contains words like cedar, larch and fir

This type of organization is valid for other realities than living beings, as the linguist and lexicographer Alain Rey has analyzed: 

"These logical classes, whose constitution is studied by taxonomy, manifest themselves well in the field of hierarchical and systematic classifications of the natural sciences (nomenclatures), but also in the field of hierarchies of objects spontaneously observed and named within a culture (folk taxonomies)." (Alain REY, La Terminologie, p. 35, P. U. F., coll. "Que sais-je", 1979)

For example, in our graph, the category drink is superior to the category alcoholic_drink, which is superior to several categories including wine and beer. Such an organization facilitates both the creation of graphs and their consultation. As we shall see, for certain nouns designating abstract entities and for verbal categories, the hierarchy is based on another type of organization. 

This raises the question of the granularity of the categories, i.e. their level of precision. Lettria has chosen to propose a general representation intended to cover all knowledge domains without privileging any of them, while keeping the possibility of proposing complementary graphs that are more detailed in specialized domains such as medicine or agronomy. This is why some categories group the hyponyms of a word with this word itself. For example, the category dog is a final category – it has no subcategories – which contains the word dog and its synonyms mutt and bow-wow, but also poodle, greyhound and corgi, although these are hyponyms of dog.

Moreover, such an organization is necessarily subjective, as a word can have several hypernyms. The classification of living beings itself does not have a consensus and is constantly evolving at the rhythm of scientific discoveries. It is therefore the most salient semantic features and the most commonly accepted representations by non-specialist speakers that have been retained. Mother is generally considered as a hyponym of woman, but we preferred to include it in the category family_member ("person designated by his family or marital relationship") with uncle and cousin. Bell pepper is included in the category vegetable although from the scientific point of view it is a fruit. We are in a so-called naive or popular and non-scientific representation of the world, and this is what we are looking for by formalizing the information conveyed by everyday language, outside of any scientific field. 

Lettria’s database, including thousands of words associated to their possible categories, is therefore both a linguistic database and an encyclopedic database that offers a structured representation of the realities and concepts designated by the words. Eventually, the categories will be enriched with ontological properties that will allow, among other things, to assign them a certain number of attributes (for example, a person has a date of birth, a city has geographical coordinates) but also to link them by relations other than the taxonomic hierarchy (a person having a city of birth, the corresponding categories will be linked by a given relation). This enrichment will not only allow us to refine our structuration work, but also to offer Lettria's clients a firm base for the creation of their own ontologies.


The creation of categories is also of great use in Lettria’s disambiguation work. Ambiguity receives the following definition:

"Ambiguity is a property of natural languages due to the absence of a one-to-one correspondence between form and meaning. It can be defined semantically, close to immediate perception: any statement that can be interpreted in several ways is ambiguous." (Michel ARRIVÉ, Françoise GADET & Michel GALMICHE, La Grammaire d’aujourd’hui, p. 60, Flammarion, 1986)

At the lexical level, disambiguation is the task of removing ambiguity in the meaning of words, which concerns both homographs (such as yard in It’s one yard long and Children play in the yard) and polysemous words (the different meanings of head in He has a hat on his head and She’s the head of the government, for example). If human beings are able to disambiguate in many cases thanks to the context, it is not the case for the machine if it has not been trained beforehand. 

Building its NLP model, Lettria chose to tackle this problem with machine learning, a field of artificial intelligence that aims to create models that improve their performance according to the data they process. Thus, annotation campaigns were set up. Lettria's teams annotated textual data of various origins by assigning each word the appropriate category. After several months of training, the correct category is most of the time automatically recognized by the machine thanks to context analysis.

Note that the semantic ambiguity also concerns the name of the categories. A category is often named after one of the most representative words of the category in question. Since the aim is to create a univocal model, it is necessary that a name be used for a single category, but also that each name be as unambiguous as possible, because even if each category is accompanied by a definition, an ambiguous name can be a source of error during annotation. In this case, a monosemic word or a word with a clearly delimited meaning is preferred. In some cases, a paraphrase is also used. The category with the definition "character of a person who lacks moral values" is thus named moral_lowness.

The case of phrases

Lexical fixation

A major obstacle in semantic categorization is the impossibility of assigning meaning to words that appear in specific sequences : the meaning of a sequence of words is not always the combination of the meanings of the units that compose them. When it’s raining cats and dogs, you don’t see animals falling. Pulling someone’s leg doesn’t require touching them. And when you say "break a leg" to someone, you are not enjoining them to hurt anyone. To this is added the existence of fossil words, which can never or no longer be used freely. Fro, in to and fro, nother, in a whole nother, or the French fur in au fur et à mesure ("as one goes along") are typical examples. 

This is the lexical fixation phenomenon: the elements of discourse lose their semantic autonomy within certain combinations. These combinations are grouped under different names, among which we find phrases (unités phraséologiques in French), and their definition varies according to the authors, just like their delimitation:

"The set of fixed expressions is heterogeneous, and several distinct types may be observed." (Rosamund MOON, ‘There is reason in the roasting of eggs': a consideration of fixed expressions in native-speaker dictionaries, in EURALEX '92: Proceedings I-II : papers submitted to the 5th EURALEX International Congress on lexicography in Tampere, Finland (Studia translatologica), p. 494, Oxford University Press, 1992)

But whether they are open compound words (rec room) or figurative expressions (to beat around the bush), the problem is the same for linguists who cannot semantically categorize each element. The difficulty is even greater in the case of the English language with its many phrasal verbs, units composed of a verb and a grammatical word (often an adverb or a preposition) whose meanings differ from those of the verb in free use. To cry up, for example, is used transitively in the meaning of "to praise", a sense that cry does not have on its own and is not the combination of the senses of cry and up.

Tokenization and phraseological categorization

Lettria attempts to solve this problem in two stages of data processing. The first is tokenization, which consists of segmenting each sentence into tokens, i.e. manipulatable linguistic units such as words and punctuation elements, but also entities such as proper nouns (Statue of Liberty) or dates. At this stage, a certain number of polylexical sequences are identified as minimal linguistic units semantically indecomposable. Bell pepper and de facto, although composed of several sequences of signs separated by spaces, will be lemmatized and treated as minimal elements and categorized as such. But these are only short monosemic sequences, identified as words in the commonly accepted sense of the term.

A later stage allows the recognition of phrases at a higher level in the discourse and previously listed by Lettria as phraseological sequences that can be decomposed but whose combination has a non-predictable meaning. It is at this stage that locutions, expressions and phrasal verbs are identified, to which one or more possible categories are associated, as these sequences may be polysemous. During disambiguation, the units listed are identified and categorized by the machine in the same way as monolexical units. The sequences lose life and cry up (and all their conjugated forms) will therefore be respectively assigned the categories die and praise. When several categories are possible, disambiguation works in the same way as for monolexical units, thanks to contextual analysis. For example, the verb deliver will receive the category give_(sthg) or express_(sthg) depending on whether the object designates something material or an information.

Literal meaning vs figurative meaning

But if some phrases can receive several categories, there are also sequences which can be phrases or not: like monolexical units, polylexical units can have a literal and a figurative meaning. Piece of cake can thus be a phrase with a figurative meaning, "something easily done" (Merriam-Webster.com, 2022), and thus correspond to the simple_thing category: Operating this device is a piece of cake. But, of course, it can also be used literally: I'll have a piece of cake for dessert. It is therefore not possible here to assign a category to the whole sequence insofar as the semantic unity is at the level of each word.

The preferred solution is based on the frequency. When the literal meaning is possible but a priori much less frequent, the figurative meaning is chosen. This is why to give the cold shoulder will be considered as a phrase receiving the category neglect_(sthg). When the figurative meaning is less frequent or in the case of a more or less equal frequency, it is the literal meaning that is privileged, as for piece of cake. The sequence is then considered as a free combination of autonomous elements which will each receive their own category. 

Categorizing verbs

A non-taxonomic organization

The semantic categorization of verbs raises several problems, starting with their organization within a graph. Some verb meanings can be more precise than others: get_old could be a subcategory of get_changed for example. But, in the vast majority of cases, verbs do not have hypernymy and hyponymy relationships between them like those of nouns. Verbs with a meaning close to that of reflect, such as think or brainstorm, have nuances that do not require a different level of precision. Therefore, in the graph devoted to verbal categories, only the final categories, those without subcategories, can receive verbal meanings.

In this graph, the intermediate categories between the initial node and the different final categories have the function of ensuring the overall coherence and of facilitating the location of the right category while annotating, which would be less possible with a simple list. The main branches correspond to the different types of actions or states expressed by the verbs: movement, possession, change, communication, etc. For example, the POSSESSING branch contains the HAVE and HAVE_NOT sub-branches. HAVE contains the categories have_(sthg) and obtain_(sthg), while HAVE_NOT contains have_not_(sthg) and lose_(sthg)

The argument structure 

In addition to the thematic organization, the verbal categories graph and its categories themselves are created according to the argument structure of the predicates. In order for the discourse to be intelligible, words impose certain syntactic and/or semantic constraints. Verbs are the most typical representatives of this. A conjugated verb must be accompanied by other linguistic units (usually noun phrases) that indicate which entities are participating in the process. These units are called arguments (actants in French).

Verbs are therefore at the heart of textual data structuration: they provide information not only about the events and states they designate, but also about the entities designated by other words in the sentence. For example, in Paul sold his car to Sarah, we know from the verb sell that Sarah owns the car and that Paul no longer owns it. In order for the model to report this information, it is necessary to transmit to the machine through categories taking into account two properties related to the argument structure of verbs: verbal valency and semantic roles. 

Verbal valency

Concept developed by Lucien Tesnière (Éléments de syntaxe structurale, Klincksieck, 1959), verbal valency is the number of arguments necessary for a verb with a given meaning to occur in discourse. Impersonal verbs aside, this number varies from one to three. Verbs with one argument are intransitive verbs (I sleep): the single argument is the subject. Verbs with two arguments are transitive verbs with one object, whether it is a direct (I eat an apple) or an indirect object (This sweater belongs to Paul): the first argument is the subject and the second is the direct or indirect object. Three-arguments verbs are transitive verbs with two objects (I gave her a book or I gave a book to her), which are respectively the second and third arguments.

However, many verbs alternate between a one-argument valency and a two-argument valency. To move, for example, can be used as an intransitive verb and as a transitive verb: I move slowly / I move the chair. In the first case, it is the subject that undergoes the action while in the second it is the object that is moved. Although it is the same verb, the information concerning the arguments is totally different according to the valency and this must be taken into account in the representation of the information.

This is why this graph has a first division between an ACT branch and an INTERACT branch. Generally speaking, we find on the ACT side verbs with one argument and on the INTERACT side verbs with two arguments. This is the case for verbs of movement and verbs of change (The pages have turned yellow / The sun has turned the pages yellow), which creates parallel categories in both parts of the graph. For some types of verbs, the difference is between two and three arguments, respectively on the ACT side and on the INTERACT side. This is the case for verbs of possession: own (Paul owns a car) is on the ACT side while give (Paul gives his car to Sarah) is on the INTERACT side.

Some verbs can have several possible valencies without the meaning being really different, especially verbs that can be used absolutely, i.e. those whose object can be omitted without the sentence being agrammatical. This is the case of writing in Tom writes his name and in Tom writes badly or eating in Sarah eats an apple and in Sarah eats too fast. In this case, there is no need to separate them into different categories: the meaning of the verb is the same and there is no difference in the information about the subject.

Semantic roles 

In addition to verbal valency, there is the question of the semantic roles assigned to the arguments. The semantic role is the role played by each argument in the process expressed by the verb. A noun with a certain grammatical function (subject or object) can have different semantic roles depending on the verb that governs it: agent, patient, beneficiary, goal, place, etc. The list is not exhaustive and varies according to the approach.

However, in the same semantic field, and therefore in the same branch of the graph, two verbs of identical valency may not assign the same roles to the arguments. Among the verbs of possession, let's take the case of the verbs offer in They offered a protective helmet to the team and equip in They equipped the team with a protective helmet. In the first sentence, the beneficiary is the indirect object, whereas in the second sentence the beneficiary is the direct object. It is therefore necessary to assign different categories to these verbs which generate different semantic structures: it has consequences on the structuring of the information. Thus, offer is found in the category give_(sthg) while equip is found in provide_(sbody). We note that the names of these categories emphasize this difference, the direct object being more often a thing in the first case and a human being in the second one. 

This phenomenon can also exist for the same verb. This is the case of supply: This producer supplies local restaurants with wine / This producer supplies wine to local restaurants. In the first example the beneficiary is the direct object and in the second case it is the indirect one. The verb supply is therefore found in both categories, provide_(sbody) and give_(sthg)

French pronominals 

Transitive form vs pronominal form

French presents a particular difficulty in the categorization of pronominal verbs, which are traditionally divided into three groups whose properties will lead to differences in annotation. Essentially pronominal uses (sometimes called in French pronominaux lexicalisés, "lexicalized pronominals"), that accept only a pronominal object in the same person as the subject, are generally not problematic. This is the case for verbs that exist only in the pronominal form, such as s’abstenir ("to abstain"), s’évanouir ("to faint") or se repentir ("to repent"), or for certain verb meanings that can also be used in a non-pronominal way. Thus, in the sentence Je m’attends à ce qu’il pleuve demain ("I expect it to rain tomorrow"), s’attendre does not correspond to any transitive use of attendre. The pronoun se is considered semantically empty and has no semantic role. This last use therefore belongs in the expect category with non-pronominal verbs such as pressentir and pronostiquer.

When a transitive verb is used in the pronominal form without change of meaning, it belongs to the same category. In the examples Elle maquille sa sœur and Elle se maquille, the patient is always the object: sa sœur in the first sentence and the pronoun se in the second one. But some cases are more problematic. In the sentence Je me déplace lentement, should we consider that déplacer belongs to the category move_in_space ("to go to another place") with intransitive verbs such as marcher ("to walk") or should we integrate it to the category move_(sthg)_in_space like the transitive use of move (I move the chair slowly)? The second option has been chosen, considering that the subject (Je) acts on itself (me) as in the previous example. 

Two criteria are therefore essential to judge that a pronominal form is similar to a transitive form. First, we must make sure that the verb exists with the same meaning in the transitive form. This was not the case with s’attendre: one cannot say *J’attends mon frère à ce qu’il pleuve demain. On the other hand, it is the case of déplacer in our last example: Je déplace le fauteuil lentement. It is necessary that the subject is the agent of the action. This is the case in Je me déplace lentement, but it is not the case in a sentence like La pièce s’est réchauffée rapidement ("The room warmed up quickly"). Se réchauffer will therefore find its place in the get_hot category and not in heat_(sthg). We note that the usual lexicographical analysis corroborates our choice, as se réchauffer has its own definition in dictionaries:  "Devenir plus chaud. La température se réchauffe." ("To become warmer. The temperature warms up.", Le Petit Robert de la langue française, édition numérique, Éditions Le Robert, 2022).

Accidental passive pronominals

This last case, sometimes considered as a passive pronominal, should be distinguished from what we will call accidental passive pronominals. In the sentences Les robes vertes se vendent beaucoup cette année ("Green dresses are selling a lot this year") and Cette machine se pilote facilement ("This machine is easily controlled"), the verbs do not meet the criteria necessary for their inclusion in the same category as their transitive form. However, they are semantically different from the previous example. In La pièce s'est réchauffée rapidement, it is difficult to modify the utterance to give la pièce the object function without slightly modifying the meaning of the sentence: On a réchauffé la pièce rapidement implies an agent performing a voluntary action. This was not the case with the pronominal form. However, it does work with the other two examples. On vend beaucoup de robes vertes cette année and On pilote facilement cette machine do not change the meaning of the original sentences which already implied the existence of an agent acting voluntarily: people or organizations that sell green dresses sell a lot of them this year and anyone who drives this machine drives it easily.

This is similar to the passive voice with omission of the agent: Les robes vertes sont beaucoup vendues cette année ; Cette machine est facilement pilotée. Such a transformation is again not possible with the first example: La pièce a été rapidement réchauffée implies an agent that the initial sentence does not imply. Moreover, in the case of accidental passive pronominals, one can easily express the same idea with an impersonal turn of phrase: Il se vend beaucoup de robes vertes cette année; Il est facile de piloter cette machine. Again, the same attempt with the first example necessarily implies an agent and changes the meaning of the sentence: Il a été rapide de chauffer la pièce.

In the sentences with vendre and piloter, we will therefore classify the verbs in the same categories as their transitive uses, contrary to the example given with réchauffer. The attempts at transformations shown above are all criteria for correctly categorizing a pronominal verb with passive value. And the result corresponds to our structuration needs: in the case of the accidental passive pronominals, we can formalize the action of one entity on another, while in the other case there is only one entity involved and it undergoes a change of state.

The problems peculiar to the creation of semantic categories and, more generally, by the structuring of textual data, therefore require in-depth knowledge and analysis of natural languages and the specific characteristics of each one. This is a necessity that Lettria takes fully into account in the building of its models, with a constant concern for efficiency and with a single goal in mind: to offer its clients the most successful solution possible in order to extract from their data the relevant information for the smooth operation and development of their business.

11 min read

Ready to go for the gold with Lettria?
Build your NLP pipeline for free
Get started ->