RED-8670: add features to status update See merge request fforesight/azure-ocr-service!23
OCR Service
Overview
The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction.
Dependencies
Leptonica
Ghostscript
PDFTron
PDFBox
Functionality
- Invisible Element and Watermark Removal
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF. - Image Extraction
Extracts all images from the PDF using PDFBox - Image Processing
Renders all pages with images using ghostscript and processes them using leptonica. - OCR Processing
Calls the azure API in batches, receives text bbox and content. - Font style detection
Detection of bold text using stroke width estimation - Text Integration
Draws the resulting text onto the original PDF using PDFtron.
Steps 2-5 are run in parallel.
Installation
To run the OCR service, no special dependencies are requires, just run:
-
Ghostscript:
Install using apt.sudo apt install ghostscript -
Leptonica:
Install using vcpkg with the command and set the environment variableVCPKG_DYNAMIC_LIBto your vcpkg lib folder (e.g. ~ /vcpkg/installed/x64-linux-dynamic/lib).vcpkg install leptonica --triplet x64-linux-dynamic -
Other dependencies are handled by Gradle build
gradle build
The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT)
Configuration
Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
OCR_SERVICE_OCR_THREAD_COUNT=16
Possible configurations and their defaults include:
// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
int concurrency = 8;
// Limits the number of pages per call.
int batchSize = 128;
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
boolean drawTablesAsLines; // writes the tables to the PDF as invisible lines.
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....
Integration
The OCR-service communicates via RabbitMQ and uses the queues ocr_request_queue, ocr_response_queue,
ocr_dead_letter_queue, and ocr_status_update_response_queue.
ocr_request_queue
This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.
ocr_response_queue
This queue is also used to signal the end of processing.
ocr_dead_letter_queue
This queue is used to signal an error has occurred during processing.
ocr_status_update_response_queue
This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed.