Home » Watch webinar: Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3

Watch the webinar

Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3

The Transformer architecture in NLP has truly changed the way we analyze text. NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information – which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).

We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4). It can be used in such complicated fields as healthcare or fintech NLP.

About the speaker

Mykola Melnyk

Senior Scala, Python, and Spark software engineer

Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.