The AI industry’s most widely used NLP library for Python and Java delivers a major new release of its software & models – improving accuracy and scale for named entity recognition, object character recognition, matching, and entity resolution.
We are very pleased to announce the immediate availability of Spark NLP 2.4. This is the library’s biggest release ever, with major accuracy & scalability improvements across the open source, enterprise, and healthcare editions.
The changes include improvements to the core architecture of the library, retraining of all pre-trained models from scratch, and a suite of new pre-trained models & deep-learning networks that leverage new academic research results from 2019. In most cases, this release is the first production-grade, scalable, and trainable implementation of these new research made available to the AI community.
Named entity recognition: Spark NLP 2.4 still makes half as many mistakes as spaCy 2.2
Named entity recognition (NER) for entities such as people, places, drugs, genes, and others from free text is one of the most widely used NLP tasks. Transformers such as BERT, ELMO and others have improved the achievable accuracy on NER over the past two years – and Spark NLP 2.4 now comes with several out-of-the-box pipelines that make the most of these innovations:
- Out-of-the-box Spark NLP models deliver an F1 score of 95.9% using BERT-large on the standard en_core_web benchmark – versus 88.3% delivered by the spaCy 2.2 BERT model.
- This means that Spark NLP models will make between one half to one third of the mistakes that the spaCy model is expected to make.
- Spark NLP includes five pre-trained NER models – enabling users to trade off accuracy for speed or memory. The least accurate Spark NLP models is still more accuracy than all spaCy models, including the largest one.
- In addition, Spark NLP NER models are trainable – so that users can train & tune even more accurate for their own domain-specific applications.
Object Character Recognition (OCR): Automated image enhancement & scalable pipelines
Spark OCR is now a separate library from Spark NLP – enabling to configure object character recognition pipelines that improve accuracy for specific document types.
Spark OCR is now being in production in various large-scale, high-compliance use cases to read clinical records, faxes, invoices, books, and other document types. This new release has enabled customers to reach and surpass the accuracy previously achieved by OCR industry leaders such as Abbey, AWS, and Google Cloud – by implementing image processing algorithms, automating their selection and use, and enabling users to tune OCR pipelines for domain-specific document types.
Spark OCR is unique in its ability to scale OCR processing on any Spark cluster, unify image processing with downstream information extraction from text (using NLP techniques), and running on a customer’s infrastructure without requiring sharing or sending documents to a cloud provider.
Context-Based Text Matching: Accurately extract facts from large documents
A common NLP use case is extracting structured data from large documents. Financial statements, medical records, and legal documents can often be hundreds of pages long. In such cases, finding a specific fact – like a date, a monetary value, or a name – can be challenging since a document can include hundreds of such values to choose from.
Spark NLP 2.4 includes a context-based text matcher which enables users to specify the context inside a document in which a match should be searched for. The algorithm then first finds the relevant context and then performs a deeper search for the request fact within it.
Clinical entity resolution: Accurately map entities to large, hierarchical ontologies
Spark NLP for Healthcare already had the ability to map clinical entities to medical terminologies – such as drugs to RxNorm codes, procedures to ICD-10-PCS or CPT codes, and others. This release brings new pre-trained models with better accuracy:
- All entity resolution models have been re-trained from scratch for improved accuracy based on newer algorithms & deep learning networks
- Mapping clinical terms within a specific category (like cancer staging or body part) can now be tuned to be more accurate for specialty-specific use cases
- Models for larger terminologies, like SNOMED-CT, are now faster and require a smaller memory footprint to run
- All models have been re-trained to reflect the most recent medical terminologies
More new functionality
The Spark NLP 2.4 Release Notes list the entire set of new features, upgrades, and bug fixes within this major release. Major new features include:
- Document classification – supporting this common NLP task directly within Spark NLP
- Shared in-memory storage – efficient loading & reusing of large models & embeddings
- Recursive pipelines – enabling better support for multi-lingual & hierarchical pipelines
- Lazy annotators – enabling a small memory footprint when running very large pipelines
“This release continues our years-long commitment to provide our customers and the AI community the world’s most accurate, fast, and scalable NLP library”, said Saif Addin-Ellafi, lead Spark NLP developer at John Snow Labs.