100 lines
3.5 KiB
Markdown
100 lines
3.5 KiB
Markdown
# OCR Service
|
|
|
|
## Overview
|
|
|
|
The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction.
|
|
|
|
## Dependencies
|
|
|
|
[Leptonica](http://leptonica.org/)
|
|
[Ghostscript](https://www.ghostscript.com/)
|
|
[PDFTron](https://apryse.com/)
|
|
[PDFBox](https://pdfbox.apache.org/)
|
|
|
|
## Functionality
|
|
|
|
1. Invisible Element and Watermark Removal
|
|
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
|
|
2. Image Extraction
|
|
Extracts all images from the PDF using PDFBox
|
|
3. Image Processing
|
|
Renders all pages with images using ghostscript and processes them using leptonica.
|
|
4. OCR Processing
|
|
Calls the azure API in batches, receives text bbox and content.
|
|
5. Font style detection
|
|
Detection of bold text using stroke width estimation
|
|
6. Text Integration
|
|
Draws the resulting text onto the original PDF using PDFtron.
|
|
|
|
Steps 2-5 are run in parallel.
|
|
|
|
## Installation
|
|
|
|
To run the OCR service, no special dependencies are requires, just run:
|
|
|
|
1. Ghostscript:
|
|
Install using apt.
|
|
|
|
```bash
|
|
sudo apt install ghostscript
|
|
```
|
|
|
|
2. Leptonica:
|
|
Install using [vcpkg](https://github.com/microsoft/vcpkg) with the command
|
|
and set the environment variable `VCPKG_DYNAMIC_LIB` to your vcpkg lib folder (e.g. ~
|
|
/vcpkg/installed/x64-linux-dynamic/lib).
|
|
|
|
```
|
|
vcpkg install leptonica --triplet x64-linux-dynamic
|
|
```
|
|
3. Other dependencies are handled by Gradle build
|
|
|
|
```bash
|
|
gradle build
|
|
```
|
|
|
|
The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT)
|
|
|
|
## Configuration
|
|
|
|
Configuration settings are available in the OcrServiceSettings class.
|
|
These settings can be overridden using environment variables. e.g.
|
|
`OCR_SERVICE_OCR_THREAD_COUNT=16`
|
|
|
|
Possible configurations and their defaults include:
|
|
|
|
```java
|
|
// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
|
|
int concurrency = 8;
|
|
// Limits the number of pages per call.
|
|
int batchSize = 128;
|
|
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
|
|
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
|
|
boolean tableDetection; // writes the tables to the PDF as invisible lines.
|
|
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
|
|
boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica
|
|
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....
|
|
```
|
|
|
|
## Integration
|
|
|
|
The OCR-service communicates via RabbitMQ and uses the queues `ocr_request_queue`, `ocr_response_queue`,
|
|
`ocr_dead_letter_queue`, and `ocr_status_update_response_queue`.
|
|
|
|
### ocr_request_queue
|
|
|
|
This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.
|
|
|
|
### ocr_response_queue
|
|
|
|
This queue is also used to signal the end of processing.
|
|
|
|
### ocr_dead_letter_queue
|
|
|
|
This queue is used to signal an error has occurred during processing.
|
|
|
|
### ocr_status_update_response_queue
|
|
|
|
This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than
|
|
initially assumed.
|