This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.
Added to the ImageTextDetectorV2
- New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
- New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
- New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.
Added to the ImageToTextV2
- New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
- New support for formatted output, and HOCR.
ocr.setOutputFormat(OcrOutputFormat.HOCR)
ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)
Support for Spark 3.2
- We added support for the latest Spark version, check the installation instructions below. Improved documentation on the website.
New Models
- ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
- ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
- ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2
New notebooks
+ SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2
Get & Install it here
Try OCR tool for healthcare
See in action