azure-ocr-service/README.md

# OCR Service

## Overview

The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction.

## Dependencies

[Leptonica](http://leptonica.org/)
[Ghostscript](https://www.ghostscript.com/)
[PDFTron](https://apryse.com/)
[PDFBox](https://pdfbox.apache.org/)

## Functionality

1. Invisible Element and Watermark Removal
   The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
2. Image Extraction
   Extracts all images from the PDF using PDFBox
3. Image Processing
   Renders all pages with images using ghostscript and processes them using leptonica.
4. OCR Processing
   Calls the azure API in batches, receives text bbox and content.
5. Font style detection
   Detection of bold text using stroke width estimation
6. Text Integration
   Draws the resulting text onto the original PDF using PDFtron.

Steps 2-5 are run in parallel.

## Installation

To run the OCR service, no special dependencies are requires, just run:

1. Ghostscript:
Install using apt.

   ```bash
   sudo apt install ghostscript
   ```

2. Leptonica:
   Install using [vcpkg](https://github.com/microsoft/vcpkg) with the command
   and set the environment variable `VCPKG_DYNAMIC_LIB` to your vcpkg lib folder (e.g. ~
   /vcpkg/installed/x64-linux-dynamic/lib).

   ```
   vcpkg install leptonica --triplet x64-linux-dynamic
   ```
3. Other dependencies are handled by Gradle build

   ```bash
   gradle build
   ```

The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT)

## Configuration

Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
`OCR_SERVICE_OCR_THREAD_COUNT=16`

Possible configurations and their defaults include:

```java
// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
int concurrency = 8;
// Limits the number of pages per call.
int batchSize = 128;
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
boolean tableDetection; // writes the tables to the PDF as invisible lines.
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....
```

## Integration

The OCR-service communicates via RabbitMQ and uses the queues `ocr_request_queue`, `ocr_response_queue`,
`ocr_dead_letter_queue`, and `ocr_status_update_response_queue`.

### ocr_request_queue

This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.

### ocr_response_queue

This queue is also used to signal the end of processing.

### ocr_dead_letter_queue

This queue is used to signal an error has occurred during processing.

### ocr_status_update_response_queue

This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than
initially assumed.