Say Goodbye to Typos and Spelling Errors: Fix Them at Scale with Spark NLP and Python
Correcting Typos and Spelling Errors is an important task in NLP pipelines. Being able to rely on correct data, without spelling problems, can improve the performance of many machine learning models applied to the fixed data. We introduce how to perform spell checking with rules-based and machine learning based models in Spark NLP with Python.
Introduction
Spell checking is the process of identifying words in texts that have spelling errors or are misspelled. Text data originated from social media or extracted from images using Optical Character Recognition (OCR) usually contains typos, misspellings, spurious symbols, or errors that can impact machine learning models trained on this data.
In a machine learning perspective, having those spelling errors can impact the performance of models applied to this data. For example, if the word John is present in the data both with the correct spelling and J0hn (a zero character replaced the o letter), then a model would treat them as two separated words, which could cause unexpected outcomes in its predictions.
This can happen because words that should have the same sense (and the same weight/embedding) are being considered two different entities in the model, increasing the complexity of the model, and reducing the capability to understand the underlying representation of the words.
To tackle this problem, we can use spell checking and correction features to preprocess the data before using it for model training. In the rest of this post, we will introduce how to use the Spark NLP library to both run rule-based and machine learning based spell-checking using Python.
If you are not familiar with Spark NLP, I suggest you review the documentation or this blog post. In particular, it is important to be familiar with the concepts of annotators and annotations.
Background
To fix spelling errors, the usual approach is to make modifications in the selected word and compare with existing words in a dictionary. If a modified version of the word is found in the dictionary, then it is a candidate to fix the misspelled word.
Modifications that can be made to a word can be:
- Add one letter (insert)
- Delete one letter (delete)
- Replace one letter for another one (replace)
- Swaps two adjacent letters (transpose)
Apart from checking all modifications of the word to all words in the dictionary, there are two widely used algorithms to identify candidate words:
- Peter Norvig: Creates a set of modified words by making modifications of edit distance equal to two on the search word. For a word of size
n
, the total number of modified words is equal ton (deletes) + n-1 (transposes) + 26n (replaces)+ 26(n+1) (insertions) = 54n + 25
, which can get big. If any modified word is present in the dictionary, add it to the final list of candidate words. Then, select the one with the highest probability (based on the trained corpus). - SymSpell: Like Norvig’s approach, but much faster on the inference since the modified words list uses only the delete This makes the number of modified words much smaller (
n
instead of54n + 25
) and thus the comparison with the dictionary words is much faster. Also, since we don’t need any insertion, it is language agnostic.
More recently, the use of machine learning and deep learning approaches achieve excellent results, as they are capable of not only checking the candidate word but also using the context (words around the search word). The capability of checking the context is crucial, as the correct word can be different based on it. Let’s illustrate it with an example. Consider the siter
. This word is not part of the English dictionary, so we can check which could be the intended word by making only one change of letter in it:
- sister, by adding one “s” to the word
- site, by removing the “r”
- sites, by replacing “r” by “s”
All these three words exists in the English dictionary and are candidates to correct the word “siter”, and which one to choose will depend on the context. For example, which one should we use in the sentence “I will call my siter.”? By adding context, the answer is clear.
Spell checking in Spark NLP
In spark NLP, there are three different approaches for spell checking and correction:
- NorvigSweetingannotator is based on Peter Norvig’s algorithm with some modifications like limiting vowel swapping and using Hamming distance as well as Lichenstein distance.
- SymmetricDeleteannotator is based on the SymSpell
- ContextSpellCheckerannotator is a deep learning model that uses contextual information to both detect errors and produce the best corrections through the Viterbi decoder.
These annotators need token (word) annotation as input and outputs tokens as well with the corrections. Thus, to create the Spark NLP pipelines we need just three stages: DocumentAssembler
to transform raw texts into document annotations, Tokenizer
that splits the documents into tokens/words and the spell checker. Let’s give an example.
import sparknlp from sparknlp.base import DocumentAssembler from sparknlp.annotator import ( Tokenizer, ContextSpellCheckerModel, NorvigSweetingModel, SymmetricDeleteModel, ) # Start spark session spark = sparknlp.start() documentAssembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") # One of the spell checker annotator symspell = ( SymmetricDeleteModel.pretrained("spellcheck_sd") .setInputCols(["token"]) .setOutputCol("symspell") ) norvig = ( NorvigSweetingModel.pretrained("spellcheck_norvig") .setInputCols(["token"]) .setOutputCol("norvig") ) context = ( ContextSpellCheckerModel.pretrained("spellcheck_dl") .setInputCols("token") .setOutputCol("context") ) # Define the pipeline stages pipeline = Pipeline().setStages( [documentAssembler, tokenizer, symspell, norvig, context] )
We defined the necessary stages of the pipeline. Note that we used the ContextSpellCheckerModel
, which uses pretrained models available on NLP Models Hub that are automatically downloaded and instantiated. To make prediction using the pipeline, we need first to fit it to data and obtain the PipelineModel that can use the .transform()
method. Since our pipeline contains only pretrained models, no training is performed during the fit operation, and this is just a formality. Thus, let’s create an empty data frame to fit the pipeline.
empty_df = spark.createDataFrame([[""]]).toDF("text") pipelineModel = pipeline. Fit(empty_df)
And now we are ready to perform inferences in given spark data frames.
example = spark.CreateDataFrame( [["Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste"]] ).toDF("text") result = pipelineModel.transform(eample) result.selectExpr( "norvig.result as norvig", "symspell.result as symspell", "context.result as context", ).show(truncate=False)
Obtaining the following result:
+-------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ |norvig |symspell |context | +-------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------+ |[Please, allow, me, tao, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]|[Place, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, und, taste]|[Please, allow, me, to, introduce, myself, ,, I, am, a, man, of, wealth, and, taste]| +-------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
We can see that the context model obtained the best result in this example. This is usually the case since we are using a more complex model that takes context into consideration.
But what happens if your spell checkers are not working in your specific domain? In this case, you can train your own models that would be specialized in that domain. Let’s see how to do that.
One-liner alternative
In October 2022, John Snow Labs released the open-source johnsnowlabs
library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with pip
:
pip install johnsnowlabs
Please check the official documentation for more examples and usage of this library. With this new library, we ca use spell checking with one line of code. You can check the namespace reference for all the possible models, we will show how to run using the default ones.
# Import the NLP module which contains Spark NLP and NLU libraries from johnsnowlabs import nlp # Use Norvig model nlp.load("en.spell.norvig").predict("Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste") # Use Symmetric Delete model nlp.load("en.spell.symmetric").predict("Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste") # Use context aware model nlp.load("en.spell.context").predict("Plaese alliow me tao introdduce myhelf, I am a man of wealth und tiaste")
NOTE: when using only the johnsnowlabs
library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False)
. The default parameters for the start function includes using the licensed Healthcare NLP library with nlp=True
, but we can set that to False
and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.
Training Spell Checker models
To train a new model, all we need is a corpus of text data. The bigger the corpus, the more robust your model will be. The training process is a little different for each annotator, and for training we use the Approach
version of the annotators instead of the Model
one.
from sparknlp.annotator import ( ContextSpellCheckerApproach, NorvigSweetingApproach, SymmetricDeleteApproach )
NorvigSweeting
and SymmetricDelete
training
To train the NorvigSweetingApproach
or SymmetricDelete
annotators, we need to transform the corpus into a list of words as the following dictionary.txt
:
... gummy gummic gummier gummiest gummiferous ...
Then we can train the following pipeline:
documentAssembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") spellChecker = ( NorvigSweetingApproach() # Or SymmetricDeleteApproach() .setInputCols(["token"]) .setOutputCol("spell") .setDictionary("dictionary.txt") ) pipeline = Pipeline().setStages([documentAssembler, tokenizer, spellChecker])
As the models are training solely by the dictionary, all we need to do is fit it on any data (or an empty one as above):
empty_df = spark.createDataFrame([[""]])).toDF("text") model = pipeline.fit(empty_df)
All the model parameters are fit using the provided dictionary.
Training a ContextSpellChecker
model
For the ContextSpellCheker
annotator, we use the corpus itself (not a list of words, but the full texts) to fit the model. We first instantiate the annotator with standard values for the parameters.
spellChecker = ( ContextSpellCheckerApproach() .setInputCols("token") .setOutputCol("checked") .setBatchSize(8) .setEpochs(1) .setWordMaxDistance(3) # Maximum edit distance to consider .setMaxWindowLen(3) # important to find context .setMinCount(3.0) # Removes words that appear less frequent than that .setLanguageModelClasses(1650) # Value that we have a TF graph available )
The setLanguageModelClasses
requires a little more explanation. To use Spark NLP pretrained models, we don’t need to consider the backend deep learning engine that runs the inference as the library wraps the models with the corresponding annotators. But when training custom models, we may need to create new Tensorflow Graphs to store the operations. If the provided graphs (which contain 1650 classes) are not compatible with your model, you can create the Tensorflow graph following this notebook. After creating the graph, we can set the path to the folder where the graph is stored with the parameter .setGraphFolder("path/to/graph/folder")
and adjust the number of classes.
After defining the parameters, we create the pipeline and fit it to the corpus.
pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellChecker])
Let’s use some example corpus from Arthur Conan Doyle’s first book of Sherlock Holmes as sample data.
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/holmes.txt
corpus = spark.read.text("holmes.txt").toDF("text") corpus.show(truncate=100)
+----------------------------------------------------------------------------------------------------+ | text| +----------------------------------------------------------------------------------------------------+ |THE ADVENTURES OF SHERLOCK HOLMESArthur Conan Doyle Table of contents A Scandal in Bohemia The Re...| +----------------------------------------------------------------------------------------------------+
Then, all we need to do is:
model = pipeline.fit(corpus)
Fast Inference with LightPipelines
LightPipeline is a Spark NLP pipeline class that can be used to make fast inference on Python’s base class if strings (or list of strings) in small numbers. Usually, for less than fifty thousand sentences, it optimizes the speed of inference and is usually the choice when serving Spark NLP solutions in API calls. This pipeline computes everything locally, but in parallel.
To create a LightPipeline
, we can use the fitted Pipeline directly:
from sparknlp.base import LightPipeline lp = LightPipeline(model)
Then, we can use it to make prediction directly, either with the .annotate()
method to obtain only the result
item of the annotations or the .fullAnnotate()
method to retrieve the full annotations objects. For example, we may the model we trained on Sherlock Holmes domain.
The output of light pipelines can be either a dictionary (if a single string is used), or a list of dictionaries (if a list of strings is used or if the .fullAnnotate()
method was used). The keys of the dictionary are the names of each annotation column (the output columns in the pipeline), and the values are the results/annotations.
# Return is a dictionary res = lp.annotate("Sherlok Hlmes founds the solution to the mistrey") for token, checked in zip(res["token"], res["checked"]): print(f"{token} => {checked}")
Sherlok => Sherlock Hlmes => Holmes founds => found the => the solution => solution to => to the => the mistrey => mystery
The model understood something in the language of the book and was able to fix some errors.
Conclusion
In this post, we introduced three models to perform spell checking and correction. Peter Norvig and SymSpell algorithms are fast implementations that make modifications to the misspelled word to find candidate words in the dictionary, while the deep learning implementation of the ContextSpellChecker
takes surrounding words as context as well and can correct words using decoder algorithms.
They can be easily added to Spark NLP pipelines to make corrections at scale in the spark ecosystem, allowing to process them efficiently in a big data environment.
We also introduced how to use Spark NLP pipelines to train custom spell-checking models on any of the three implementations.
References
- NorvigSweeting documentation page
- SymmetricDeleter documentation page
- ContextSpellChecker documentation page
- Applying Context Aware Spell Checking in Spark NLP
- Training a Contextual Spell Checker for Italian Language
Try Spell Checking and Correction
See in action