2024-08-19 14:47:48 +02:00

3.5 KiB

OCR Service

Overview

The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction.

Dependencies

Leptonica
Ghostscript
PDFTron
PDFBox

Functionality

  1. Invisible Element and Watermark Removal
    The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
  2. Image Extraction
    Extracts all images from the PDF using PDFBox
  3. Image Processing
    Renders all pages with images using ghostscript and processes them using leptonica.
  4. OCR Processing
    Calls the azure API in batches, receives text bbox and content.
  5. Font style detection
    Detection of bold text using stroke width estimation
  6. Text Integration
    Draws the resulting text onto the original PDF using PDFtron.

Steps 2-5 are run in parallel.

Installation

To run the OCR service, no special dependencies are requires, just run:

  1. Ghostscript:
    Install using apt.

    sudo apt install ghostscript
    
  2. Leptonica:
    Install using vcpkg with the command and set the environment variable VCPKG_DYNAMIC_LIB to your vcpkg lib folder (e.g. ~ /vcpkg/installed/x64-linux-dynamic/lib).

    vcpkg install leptonica --triplet x64-linux-dynamic
    
  3. Other dependencies are handled by Gradle build

    gradle build
    

The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT)

Configuration

Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
OCR_SERVICE_OCR_THREAD_COUNT=16

Possible configurations and their defaults include:

// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
int concurrency = 8;
// Limits the number of pages per call.
int batchSize = 128;
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
boolean drawTablesAsLines; // writes the tables to the PDF as invisible lines.
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....

Integration

The OCR-service communicates via RabbitMQ and uses the queues ocr_request_queue, ocr_response_queue, ocr_dead_letter_queue, and ocr_status_update_response_queue.

ocr_request_queue

This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.

ocr_response_queue

This queue is also used to signal the end of processing.

ocr_dead_letter_queue

This queue is used to signal an error has occurred during processing.

ocr_status_update_response_queue

This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed.