PMD fix for ocr service RED-8085 See merge request fforesight/ocr-service!37
OCR Service
Overview
The OCR service is a tool designed for extracting text content from PDF files. It utilizes Tesseract, Leptonica, PDFTron, PDFBox, and Ghostscript to perform various tasks, including removing invisible elements and watermarks, extracting images, stitching striped images, binarizing images, running OCR on the processed images, and writing the recognized text back to the original PDF. This service is particularly useful for obtaining machine-readable text from PDF documents.
Dependencies
Tesseract
Leptonica
PDFTron
PDFBox
Ghostscript
Functionality
- Invisible Element and Watermark Removal
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF. - Image Extraction
Extracts all images from the PDF using PDFBox - Striped Image Detection and Stitching
Detects if images are striped and stitches them together using Ghostscript. - Image Processing
- Convert to grayscale
- Upscale to target DPI
- Filter using Gauss kernel
- Binarizes the resulting images using Leptonica and the Otsu thresholding algorithm.
- Despeckle using various morphological operations
- OCR Processing
Runs Tesseract on the images to extract text. - Font style detection
Detection of bold text using stroke width estimation - Text Integration
Draws the resulting text onto the original PDF using PDFBox.
Steps 2.-5. happen in parallel and communicate via a blocking queue to limit RAM usage. Therefore, choosing your thread counts carefully leads to most optimal performance. For example with 18 available cores, I achieved the highest performance with 2 Image extraction threads, 2 ghostscript processes and 16 OCR threads.
Setting all threads to basically unlimited (1000+) leads to comparable performance without laborious thread tuning, but at the cost of (potentially a lot) more RAM.
Installation
To run the OCR service, ensure that the following dependencies are installed:
- Ghostscript: Install using apt.
sudo apt install ghostscript
- Tesseract and Leptonica: Install using vcpkg with the command and set the environment variable
VCPKG_DYNAMIC_LIBto your vcpkg lib folder (e.g. ~/vcpkg/installed/x64-linux-dynamic/lib).
vcpkg install tesseract --triplet x64-linux-dynamic
vcpkg install leptonica --triplet x64-linux-dynamic
- Other dependencies are handled by Gradle build
gradle build
Configuration
Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
OCR_SERVICE_OCR_THREAD_COUNT=16
Possible configurations and their defaults include:
int ocrThreadCount = 4; // Number of OCR threads
int imageExtractThreadCount = 4; // Number of image extraction threads
int gsProcessCount = 4; // Number of Ghostscript processes
int dpi = 300; // Target DPI for binarized images
int psmOverride = -1; // Overrides the page segmentation mode if > 0
int minImageHeight = 20; // Minimum height for images to be processed
int minImageWidth = 20; // Minimum width for images to be processed
boolean debug = false; // If true, overlays OCR images with a grid and draws word bounding boxes
boolean removeWatermark; // If false, watermarks will not be removed
String languages = "deu+eng"; // Defines languages loaded into Tesseract as 3-char codes, additional languages must also be installed in the docker environment
Integration
The OCR-service communicates via RabbitMQ and uses the queues ocrQueue, ocrDLQ, and ocr_status_update_response_queue.
ocrQueue
This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.
ocr_status_update_response_queue
This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed. This queue is also used to signal the end of processing.
ocrDLQ
This queue is used to signal an error has occurred during processing.