8 minute read
Automatic spelling correction, despite being worked on since the 70’s, remain hard to solve in the absence of significant user data. Noisy text is problematic for many NLP tasks as it leads to a reduction of the accuracy of machine learning based techniques and increases the number of Out-Of-Vocabulary (OOV) words that cannot be handled by popular techniques such as Word2Vec or GloVe.Hence as part of the preprocessing of our pipeline we have explored several approaches for proactive spelling correction of our input data in order to increase our accuracy on downstream tasks. We will focus here on trying to solve this problem through character-level embedding. Our work, which enables training from scratch, usage and visualization, is available for download and experimentation by following this link.
Spellings errors can be split into three groups that require different approaches:
- Real Word Errors: ‘three’ spelled as ‘there’
Real word errors are the hardest kind to correct as they are context dependent, in order to correct them we have to analyze the entire data to check for semantic consistency and grammar. Depending on the algorithm used we introduce the possibility to wrongly correct accurate sentences and add significant processing time. The most frequent way to handle this problem is seq2seq model or use of statistical data such as ngram token.
- Short forms: ‘you’ spelled as ‘u’
Short forms have the characteristic of sharing few letters with the intended word. As far as edit distance is concerned ‘u’ is closer to ‘I’ than to ‘you’, therefore dictionary based approach are less effective.
- Non-word Errors: ‘fast’ spelled as ‘fsat’
Non-word errors are the most frequent and mostly arise from typos, these will be our focus here.
Spelling correction can be divided into three main tasks:
- Taking as input a sentence, determine which words are misspelled.
- From the misspelled word, find a list of candidates for replacement.
- Choose the appropriate candidate from the list. (In case of automatic correction)
Since we are only interested in non-word errors, determining which words are misspelled is only a matter of checking them versus a dictionary, though special care must be paid regarding proper nouns.
There are a few different ways to select candidates:
- The most naive approach is to compute your chosen edit distance between your word and your entire dictionary. While accurate, this approach is prohibitively expensive.
- Phonetic algorithms such as Soundex, Phonex or Metaphone. These algorithms will encode any string into a short sequence which allows string indexing by pronunciation. ‘hurry’ and ‘hirry’ will both return ‘H600’. By preprocessing your entire dictionnary and indexing it by phonetic code, you can easily find phonetically similar candidates. Fast at runtime but it only corrects phonetic mistakes.
- Computing a list of possible misspelling (insertion, deletion, transposition or replacement) for your word and matching it with your dictionary. While this is better than the naive approach it is quite slow due to set of misspellings increasing in size at a rate of 54 * length+25. See the excellent article from Peter Norvig for more explanation.
- Symmetric spelling correction which takes the previous idea and extends it by computing misspellings both for the dictionary and the misspelled word. See Symspell for more details. This technique is both accurate and blazing fast but at the cost of significant precomputing and disk-space and requires frequency list of your dictionary.
We have explored another approach to this problem by using word embeddings to represent words at the character level. Popular approaches for word embeddings such as Word2Vec seek to represent words by their semantic as it proves very useful for a variety of tasks. However by using character based models it is possible to construct word embedding based on word spelling.The idea is similar to phonetic algorithm which aim to represent similar words under the same single metric (a string or a floating point number) but we extends the idea by using an n-dimension embedding.
Our approach is as follows :
- Train a model to produce character-level word embeddings
- Vectorize our entire dictionary and build an index for efficient search
- Vectorize misspelled word and look for nearest neighbors
The model uses two layer of LSTM to build an embedding of a chosen size. Higher dimension provides more accurate results at the cost of longer computation time, for our usage we have settled for dimension 150. The model is trained by passing it tuple of words which are either two completely different words or a word and a misspelling of it. The training objective is to minimize the norm of the difference between the two embeddings for similar tuple and maximize it for different words.
Training data has been generated using a dictionary of 600k words and generating several misspellings of edit distance 1 or 2 for each word. We have assigned higher probabilities to misspellings near the original characters on an AZERTY keyboard (since we are working in French) in order for the model to favor these. Due to the lack of clear training metrics it is difficult to know exactly when to stop training but after going through a few millions training examples we were satisfied with the results.
Once trained the model can be used to vectorize words. By projecting them on a 2D plane with PCA we can obtain the following visualization:
The original model we used was developed by IntuitionEngineering.
Success ! Similar words appear to be grouped together in the vector space.
This is not very useful by itself so now we are going to vectorize our entire 600k words dictionary. This process takes around 10 min on a modern CPU.
The final step is to construct an index which allows to efficiently search for nearest vectors. Since we don’t need a 100% accuracy and are more concerned with computing speed we will use nmslib, a library specialized in ANN (Approximate nearest neighbor) search.
This library allows us to run a search in the index for a number k of nearest neighbors of a given vector. We can adjust the number k nearest-neighbors to return to balance accuracy with computing time. Once we get the list of nearest neighbors we can further filter with edit distance to only keep relevant suggestions. This gives us the following output (in french):
python usage_correction.py 'langqge'
Edit distance 1: langqge : [‘langage’]
Edit distance 2: langqge : [‘langages’, ‘lange’, ‘langé’, ‘langage’]
We have tested our model on a corpus that was generated through our spelling error generator (around 4k sentences). Our metric will be accuracy that we define as the % of times we catch the correct word in our candidates list.
On the whole corpus, the chars2vec model gives us an accuracy of 85% while standard phonetic algorithm stands around 40%. Combining the two approaches gives us 90% accuracy. The main bottleneck is around words of 4 characters long or less which do not perform well.
This was to be expected as :
- Words of that size have a lot of neighbors within short edit distance.
- A typo in a 3 letter words is much more impactful than for 10 letter one, this makes it much harder for the model to correctly map the words next to each other in the vector space.
While our previous example directly gives us the desired answer, there are often several candidates at edit distance and a choice has to be made.We have tried two different approaches:
- Probabilistic : Use a frequency list built on a large corpus and choose the most frequent word. The obvious con is that some infrequent words may never be chosen, while extremely simple this approach provides good accuracy.
- Semantic : Rank candidates by their semantic similarity with the surrounding words in the sentence. By taking a pretrained skip-gram Word2Vec and calculating the mean distance between the embedding of the candidate and the surrounding words (or the whole sentence). While computing heavy this gives very good result for words with good semantic meaning though it is not very effective for words such as ‘there’, ‘I’, ‘are’ which can be used in any context.
Empirically we have found that the semantic approach works quite well for longer, less frequent words and the probabilistic approach is better for shorter words. Combining the two approaches with a handcrafted rules gives us around 70 % correction rate on our randomly generated corpus.
Future work: N-grams token
Instead of choosing between semantic and frequency the ideal approach would be to combine both. This can be achieved by using frequency tables of words N-grams.
To build these tables consecutive sequences of N words are counted and compiled on a very large corpus. The best resource available for this is Google Books N-gram which has been built by analyzing around 5 million books in various languages.
‘Spelling Correction’ bi-gram occurrences
The raw data is available on this link though due to the sheer amount of text it takes significant time to download and compile into an usable state.
Using character-level embeddings gives good overall performance and is particularly good with longer words, with handcrafted preprocessing rules and adequate candidates selection this provides a decent alternative to existing solutions for spelling correction.
Further works will include a proper benchmark with already existing popular solutions and the use of n-gram tokens. In the meantime this package is free to be experimented with: char2vec.