6 minute read
This article will go through the theory to demystify this insufficiently known part of NLP. Then, in a second article, we will suggest tools to help you understand how to easily implement a Dependency Parser.
When we think about a word's neighbors, we could think about the neighborhood as their location in a sentence, their relation to other words (subject, determinant, etc.), called syntax, or as their meaning similarity, called semantics. What interests us here is the syntactical neighborhood.
{{line}}
Vocabulary
First, let's define some vocabulary to make it clearer for everyone.
- Semantics is the linguistic and philosophical field that studies meaning and interpretation. It relies a lot on links between words to understand the sentence, and it analyzes the changes in meaning. In programming, semantics is the expected output of a program.
- Syntax is the linguistic field of grammar. It is the study of the rules for word patterns in sentences. Known in programming too, errors in syntax often lead to bugs, because rules are often much stricter than in oral language.
{{ligne}}
What is a Dependency Parser ?
A Dependency Tree is a structure that can be defined as a directed graph, with |V| nodes (vertices), corresponding to the words, and |A| Arcs, corresponding to the syntactic dependencies between them. We may also want to attribute labels to dependencies, called relations. These relations give details about the dependency type (e.g. Subject, Direct Object Complement, Determinant...). You can find all the relations from Universal Dependencies by following this link : https://universaldependencies.org/u/dep/index.html
Example of Dependency Tree : "What is a parser ? "
In an arc h -> d, h is the head and d is the dependent. The head is the most important node in a phrase, while the Root is the most important node in the whole sentence: it is directly or indirectly the head of every other node.
A Dependency Parser simply transforms a sentence into a Dependency Tree.
{{ligne}}
Metrics : how to recognize a good parser ?
An accurate Dependency Parser recognizes the dependencies and relations between words well. Two Metrics (scores) are useful for this:
- Unlabeled Attachment Score (UAS), which corresponds to the number of correctly predicted dependencies over the number of possibilities;
- Labeled Attachment Score (LAS), which corresponds to the number of correctly predicted dependencies and relations over the number of possibilities.
LAS is always less than or equal to UAS, because an incorrect dependency leads to a suboptimal UAS and LAS, while an incorrect relation (or label) only leads to a LAS decreasing.
{{line}}
Algorithm : How does it work ?
As you might have thought, we could create a Dependency Parser through rules developed by linguists. These parsers are called Rationalists. They are not at all efficient, since languages are very complex, and they change over time. Any small change in the language would lead to tremendous changes in the parser. Machine Learning allows for the development of Empiric parsers, which are data driven. Fed by many sentences, probabilities of dependencies or relations can be drawn. Linguistic knowledge may be used, but does not have the last word, which is a good point if you, like me, have forgotten your primary school lessons...
Several steps are needed to create a Dependency Parser. Our inputs are the words of the sentence with their properties (index, Part of Speech tag, Lemma, Features); then, we must calculate features for all possible arcs in the sentence. Thanks to these features, we compute a score for each possibility, and we finally decode scores with a decoder.
Features and Score
Each word in the sentence has some attributes, like Part of Speech tags or Lemmas. You might know them if you have already read about NLP. You can check it out here, if not:
With these features, we train a Machine Learning regression model that returns the score to be exploited by the decoder.
Feature selection is crucial, and some models allow us to bypass this part via a deep learning part. This is the case with the algorithm we will present in the following section.
Decoders
There are a lot of different decoders already developed. However, we can divide them into two categories: Transition-based decoders and Graph-based ones. Transition-based decoders are faster and need less memory to decode scores, but they are generally less accurate than Graph-based decoders. I will only go through Graph-based model principles in this article.
Other algorithms can apply different transitions, but this one allows us to understand the main principle.
Graph-Based Decoders
It is necessary to deal with graph theory to understand these algorithms.
A graph G=(V, A) is a set of vertices V (called also nodes), that represent the tokens, and arcs (i, j)" A where i, j " V. The arcs represent the dependencies between two words.
In a Graph-based dependency parser, graphs are directed, which means links have different directions, and there can be multiple arcs between nodes, this is called a multi-digraph.
Weighted Multi Directed Graph (G)
You can note that some arrows are thicker than others. This represents the weight of arcs. The more an arc weighs, the stronger the link between two nodes. We could interpret this as the strength of the syntactic dependency for our parser. For Example, C and A seem to be very dependent on B, but B does not seem very dependent on C and A.
Graph G is too connected. In order to get a Dependency Tree, we want:
- To link each word only with its dependents - not with all the words. The total number of arcs should be equal to the number of nodes minus 1 (|A| = |V|-1).
- To keep the same nodes (or tokens or words).
- To make it acyclic: we do not want a head to be dependent on one of its dependents (direct or indirect).
Fortunately, all of this already has a name: what we want is a Spanning Tree!
Example of Spanning Tree from the graph G
An other example of Spanning Tree
If I was clear on what a Spanning Tree is, you should know that there are multiple possibilities, since we only have a few conditions to get one. Here comes a trick: we want the best one, certainly, but how could we determine the "best" one?
We have 3 nodes here, and we want to keep them. However, we have 6 arcs and we want to keep only 2. The "best" Dependency Tree is the one that has the highest weights: this is called the Maximum Spanning Tree (MST).
Maximum Spanning Tree of G
Minimum Spanning Tree of G
This Maximum Spanning Tree gives us our Dependency Tree, which we will use to find the closest syntactic neighbors in the sentence.
{{line}}
Conclusion
The insight given here is very light compared to the different existing algorithms. However, this should improve your intuition when developing the Bist-Parser, the implementation of which is detailed via this link.