86 lines
4.3 KiB
Markdown
86 lines
4.3 KiB
Markdown
# OCR Service
|
|
## Overview
|
|
The OCR service is a tool designed for extracting text content from PDF files. It utilizes Tesseract, Leptonica, PDFTron, PDFBox, and Ghostscript to perform various tasks, including removing invisible elements and watermarks, extracting images, stitching striped images, binarizing images, running OCR on the processed images, and writing the recognized text back to the original PDF. This service is particularly useful for obtaining machine-readable text from PDF documents.
|
|
|
|
## Dependencies
|
|
[Tesseract](https://github.com/tesseract-ocr/tesseract)
|
|
[Leptonica](http://leptonica.org/)
|
|
[PDFTron](https://apryse.com/)
|
|
[PDFBox](https://pdfbox.apache.org/)
|
|
[Ghostscript](https://www.ghostscript.com/)
|
|
## Functionality
|
|
1. Invisible Element and Watermark Removal
|
|
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
|
|
2. Image Extraction
|
|
Extracts all images from the PDF using PDFBox
|
|
3. Striped Image Detection and Stitching
|
|
Detects if images are striped and stitches them together using Ghostscript.
|
|
4. Image Processing
|
|
- Convert to grayscale
|
|
- Upscale to target DPI
|
|
- Filter using Gauss kernel
|
|
- Binarizes the resulting images using Leptonica and the Otsu thresholding algorithm.
|
|
- Despeckle using various morphological operations
|
|
5. OCR Processing
|
|
Runs Tesseract on the images to extract text.
|
|
6. Font style detection
|
|
Detection of bold text using stroke width estimation
|
|
7. Text Integration
|
|
Draws the resulting text onto the original PDF using PDFBox.
|
|
|
|
Steps 2.-5. happen in parallel and communicate via a blocking queue to limit RAM usage.
|
|
Therefore, choosing your thread counts carefully leads to most optimal performance.
|
|
For example with 18 available cores, I achieved the highest performance with 2 Image extraction threads, 2 ghostscript processes and 16 OCR threads.
|
|
|
|
Setting all threads to basically unlimited (1000+) leads to comparable performance without laborious thread tuning, but at the cost of (potentially a lot) more RAM.
|
|
|
|
## Installation
|
|
To run the OCR service, ensure that the following dependencies are installed:
|
|
|
|
1. Ghostscript: Install using apt.
|
|
```bash
|
|
sudo apt install ghostscript
|
|
```
|
|
2. Tesseract and Leptonica: Install using [vcpkg](https://github.com/microsoft/vcpkg) with the command and set the environment variable `VCPKG_DYNAMIC_LIB` to your vcpkg lib folder (e.g. ~/vcpkg/installed/x64-linux-dynamic/lib).
|
|
```bash
|
|
vcpkg install tesseract --triplet x64-linux-dynamic
|
|
```
|
|
```bash
|
|
vcpkg install leptonica --triplet x64-linux-dynamic
|
|
```
|
|
3. Other dependencies are handled by Gradle build
|
|
```bash
|
|
gradle build
|
|
```
|
|
|
|
## Configuration
|
|
Configuration settings are available in the OcrServiceSettings class.
|
|
These settings can be overridden using environment variables. e.g.
|
|
`OCR_SERVICE_OCR_THREAD_COUNT=16`
|
|
|
|
Possible configurations and their defaults include:
|
|
|
|
```java
|
|
int ocrThreadCount = 4; // Number of OCR threads
|
|
int imageExtractThreadCount = 4; // Number of image extraction threads
|
|
int gsProcessCount = 4; // Number of Ghostscript processes
|
|
int dpi = 300; // Target DPI for binarized images
|
|
int psmOverride = -1; // Overrides the page segmentation mode if > 0
|
|
int minImageHeight = 20; // Minimum height for images to be processed
|
|
int minImageWidth = 20; // Minimum width for images to be processed
|
|
boolean debug = false; // If true, overlays OCR images with a grid and draws word bounding boxes
|
|
boolean removeWatermark; // If false, watermarks will not be removed
|
|
String languages = "deu+eng"; // Defines languages loaded into Tesseract as 3-char codes, additional languages must also be installed in the docker environment
|
|
```
|
|
## Integration
|
|
|
|
The OCR-service communicates via RabbitMQ and uses the queues `ocrQueue`, `ocrDLQ`, and `ocr_status_update_response_queue`.
|
|
|
|
### ocrQueue
|
|
This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.
|
|
### ocr_status_update_response_queue
|
|
This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed.
|
|
This queue is also used to signal the end of processing.
|
|
### ocrDLQ
|
|
This queue is used to signal an error has occurred during processing.
|