Disambiguation as a challenge for GraphRAG

Explore the different aspects of ambiguity, an inherent problem in natural language processing, and the techniques of disambiguation used to solve it.

Talk to a GraphRAG expert

Introduction

Artificial intelligence activities, and GraphRAG in particular, often have to deal with natural language and its peculiarities: systems need to understand, interpret and generate human language reliably and accurately. However, this task comes up against the problem of ambiguity. Linguistic ambiguity is an inherent feature of natural languages, where an expression, a word, a sentence or a statement can have several interpretations. The process of removing ambiguity from a text is called disambiguation and, in the field of natural language processing (NLP), is specifically referred to as word sense disambiguation. This article takes you on a discovery of the different types of linguistic ambiguity, their consequences for NLP and the disambiguation techniques used to overcome this obstacle.

The different types of linguistic ambiguity

The three main types of linguistic ambiguity are: 

  • lexical ambiguity
  • syntactic ambiguity
  • referential ambiguity

Lexical ambiguity

Lexical ambiguity arises when a word can be understood in different ways. More precisely, it can be a polysemous word or a word belonging to a group of homophones or homographs. 

  • A polysemous word is a word with several meanings. An example is head, which designates the upper part of the human body, the mind, the front or top part of something, the chief of a group, etc. In this case, the ambiguity exists both orally and in writing.
  • Homophones are words that are pronounced the same. Examples include flower and flour.  In this case, ambiguity arises only orally.
  • Homographs are words that have the same spelling. This is the case, for example, with bass, which refers to a musical instrument, and bass, which refers to a fish. In this case, the ambiguity exists only in writing.

Of course, when words are both homographs and homonyms, ambiguity exists both orally and in writing. We can cite the two words bat, which respectively designate an animal and a sports instrument. 

Note that we do not use the word homonym here, whose meaning may vary according to different linguistic theories. According to some, homonyms are necessarily homophones, whether or not they are homographs. According to others, homographs that are not homophones are homonyms. Finally, according to some schools of thought, homonyms are only words that are both homophones and homographs.

Syntactic ambiguïty

Syntactic ambiguity occurs when a sentence allows several interpretations. For example, the sentence I am looking at the man with a telescope could mean that the observer is using a telescope, or that the man being observed has a telescope. In this case, as in many others, this sentence can correspond to several deep syntactic structures. Taking another example, the fear of the enemy, enemy can be either the subject or the object of fear.

In some cases, syntactic ambiguity may be linked to the fact that a word may belong to more than one grammatical category. This is the case in They are flying planes, where flying can be a verb with they as its subject, or a gerund used as an adjective qualifying planes. Two interpretations are then possible: they are planes that are flying, or they are doing the action of drifting planes.

Referential ambiguity

The third type is referential ambiguity, sometimes known as pragmatic ambiguity. This is not semantics, which concerns the literal meaning of words, but pragmatics, which deals with how meaning is interpreted and understood in context. It occurs when a word or set of words that are not lexically ambiguous can designate several things. This may involve several elements of the sentence, particularly when it contains an anaphoric word such as a pronoun. 

For example, in the sentence Mary told Anna that she was going to win, she can refer to Mary as well as Anna. Referential ambiguity can also arise from deictics, words used to designate a referent specific to the enunciation situation. If we look at the sentence I'll go there with this, we know that the interlocutor may misinterpret there and this if these words can designate more than one referent in the situation.

Want to learn how to build a private ChatGPT using open-source technology?

NLP and language ambiguity

Why is linguistic ambiguity a problem for NLP

Linguistic ambiguity is a major challenge for NLP as it makes it difficult for systems to understand and analyze human language accurately. Whether for information extraction, text comprehension and natural language generation, NLP needs to represent information in a structured, unambiguous way. But when a word, sentence or expression has several possible interpretations, algorithms can encounter difficulties in determining the correct meaning. For example, in information retrieval, ambiguities can lead to irrelevant search results, affecting the quality of user interaction.

The case of lexical ambiguity

Among the various types of linguistic ambiguity, lexical ambiguity has a special status with regard to NLP. With other types of ambiguity, the sentence is not always enough to disambiguate. In a sentence such as Mary told Anna that she was going to win, it's impossible for either machine or human to disambiguate without more context. On the contrary, in the case of polysemous words and homonyms such as bat, the sentence in most cases allows the human being to understand the meaning. It is therefore conceivable that a machine could do the same if trained to do so.

NLP's disambiguation solutions

Since its inception, computer science has faced a long-term issue in developing the ability to perform natural language processing, and in particular to overcome the obstacle of ambiguity. Numerous disambiguation techniques have been investigated by NLP, including: 

  • Dictionary-based methods that use the knowledge encoded in lexical resources.
  • Supervised machine learning methods in which a classifier is trained for each distinct word on a corpus of manually annotated examples.
  • Totally unsupervised methods based on graphs that cluster word occurrences, thus inducing word meanings. 
  • Semi-supervised methods that combine supervised and unsupervised approaches to leverage the advantages of each method while mitigating their respective drawbacks.

Among these methods, supervised learning approaches have been the most successful method to date.

Lettria's approach to disambiguation 

Based on these facts, Lettria has chosen to set up a project for lexical disambiguation using a supervised machine learning method. The project involved data annotation, which includes the creation of semantic categories, the selection of corpora, the training of a team of annotators and the implementation of annotation sessions using a dedicated platform. Then came the training and evaluation stage, involving the choice of a model, its preparation, the development of evaluation methods and the analysis of results according to different criteria.

For Lettria, disambiguation is not an end in itself, but rather one of the key elements in our solution for knowledge extraction from unstructured documents. For example, it improves the understanding of natural language questions and the search for the most relevant information, both of which are essential in this type of project. Thus, the good results of this work could perhaps offer us particularly exciting perspectives for information processing in general and for GraphRAG in particular.

Conclusion

As a player in the NLP field, Lettria has faced the problem of ambiguity, and in particular lexical ambiguity, a major obstacle to the formalization of information produced in natural language. By using an effective method, we are now able to overcome this issue and use disambiguation to further improve solutions based on knowledge graph generation. This is one of the hallmarks of Lettria: combining the best of different approaches to overcome current GenAI limitations in knowledge extraction for the benefit of organizations.

Callout

Build your NLP pipeline for free
Get started ->