Support for OCR Pipelines in NLP Lab 5.7

10.01.2024

Dia Trambitas, Ph.D.

Ph.D. in Computer Science – Head of Product

NLP Lab 5.7 marks a significant update in the management of images containing text and scanned PDF documents with enhanced support for Visual OCR Pipelines. The Visual OCR Pipelines significantly enhance PDF and image document handling and ensure accurate, consistent, and precise text extraction. This improves the OCR results on imported PDF and/or image tasks for NER and text Classification projects.

This new release demonstrates our continuous commitment to delivering advanced tools for text analysis and document processing tailored to address our users’ evolving needs. Detailed descriptions of this new feature are provided below.

Supported Pipelines and Usage Instruction

NLP Lab version 5.7 introduces a set of seven Visual OCR Pipelines, each tailored for specific purposes, providing users with versatile options to address diverse document processing needs:

mixed_scanned_digital_pdf
mixed_scanned_digital_pdf_image_cleaner
mixed_scanned_digital_pdf_skew_correction
image_printed_transformer_extraction
pdf_printed_transformer_extraction
image_handwritten_transformer_extraction
pdf_handwritten_transformer_extraction

These pipelines offer a comprehensive set of tools, each optimized for specific scenarios, providing flexibility and precision in processing various document types in OCR projects within the NLP Lab.

User Guide

To enable OCR functionality and perform OCR tasks, follow these steps:

Go to the Task Import Page.
Activate the “OCR Document” Checkbox and deploy an OCR server.
Select OCR Pipeline.
Once the OCR server is ready, choose the desired OCR pipeline from the dropdown menu. This allows you to configure the processing settings for your OCR task.
Import your OCR files with support for both individual files and multiple files in a zipped format.

Note: Pipelines are automatically downloaded when selected from the import page, even if they haven’t been previously downloaded from the Models Hub. These steps streamline the process of enabling OCR, deploying a server, selecting a pipeline, and importing OCR files, facilitating efficient OCR document handling within the given context.

Current Limitations and Future Optimizations

Currently, task importing via dedicated OCR pipelines may take longer than using the default OCR import option. This aspect is earmarked for optimization in the subsequent releases.

These enhancements aim to improve the user experience in NLP Lab by making it more accessible and powerful. While implementing these improvements, the familiar user interface and core functionalities are retained to ensure a seamless transition for users.

Multi-page Visual NER improvements

Users can now effortlessly locate relevant pages of a multipage document task, simplifying the process of finding relevant information. This improvement allows annotators to swiftly navigate to the next relevant page for the visual Named Entity Recognition (NER) task, eliminating the need to visit each page manually.

Note: when using a section-based configuration project, if a section rule matches and, therefore, a section is identified as a positive match for a specific page of a multipage document or if a user manually assigns a specific page as a relevant section, that page will be designated/identified by the application as a relevant page and visually marked as such. All other pages are considered irrelevant.

Getting Started is Easy

The NLP Lab is a free text annotation tool that can be deployed in a couple of clicks on the AWS, Azure, or OCI Marketplaces or installed on-premise with a one-line Kubernetes script.
Get started here: https://nlp.johnsnowlabs.com/docs/en/alab/install

Start your journey with NLP Lab and experience the future of data analysis and model training today!

Try Generative AI Lab - the tool with Annotation Tool

See in action

Dia Trambitas, Ph.D.

Ph.D. in Computer Science – Head of Product

Our additional expert:

Dia Trambitas is an AI Product Manager with deep expertise in Natural Language Processing and applied Generative AI. At John Snow Labs, Dia has led the development of the Generative AI Lab — a no-code platform for data annotation and model training — as well as the Medical Chatbot, a secure and domain-specific conversational AI assistant tailored for clinical environments. With a strong focus on practical deployments of cutting-edge AI, she has worked at the intersection of healthcare and technology, driving product innovation that empowers users to harness large language models safely and effectively. Passionate about transforming unstructured data into actionable insights, Dia brings a strategic and user-centered approach to building AI tools that are both powerful and accessible.

LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models

Ali Tarik Mirik

Robustness in Models Recent developments in machine learning have revolutionized the field of Natural Language Processing in biomedical domains, healthcare NLP, and...