Go to file

Timo Bejan 0c6ce2d77b Merge branch 'RED-8085' into 'master'

PMD fix for ocr service RED-8085

See merge request fforesight/ocr-service!37

2024-01-30 07:17:37 +01:00

.dev

RED-4556: Implemented ocr standalone service

2022-12-05 12:15:55 +01:00

buildSrc

RED-8085 pmd checkup/cleanup

2024-01-29 00:19:42 +08:00

config

PMD fix for ocr service RED-8085

2024-01-30 07:17:37 +01:00

ocr-service-v1

PMD fix for ocr service RED-8085

2024-01-30 07:17:37 +01:00

.gitignore

CYB-001: Improve OCR-Module performance

2023-11-14 09:17:46 +01:00

.gitlab-ci.yml

RED-7669: optimize OCR-module performance

2023-11-24 10:21:55 +01:00

CHANGELOG.md

RED-4556: Implemented ocr standalone service

2022-12-05 12:15:55 +01:00

gradle.properties.kts

CYB-001: Improve OCR-Module performance

2023-11-14 09:17:46 +01:00

publish-custom-image.sh

CYB-001: Improve OCR-Module performance

2023-11-14 09:17:46 +01:00

README.md

RED-8155: integrate bold-detection into ocr-service

2024-01-05 16:05:53 +01:00

settings.gradle.kts

CYB-001: Improve OCR-Module performance

2023-11-14 09:17:46 +01:00

README.md

OCR Service

Overview

The OCR service is a tool designed for extracting text content from PDF files. It utilizes Tesseract, Leptonica, PDFTron, PDFBox, and Ghostscript to perform various tasks, including removing invisible elements and watermarks, extracting images, stitching striped images, binarizing images, running OCR on the processed images, and writing the recognized text back to the original PDF. This service is particularly useful for obtaining machine-readable text from PDF documents.

Dependencies

Tesseract
Leptonica
PDFTron
PDFBox
Ghostscript

Functionality

Invisible Element and Watermark Removal
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
Image Extraction
Extracts all images from the PDF using PDFBox
Striped Image Detection and Stitching
Detects if images are striped and stitches them together using Ghostscript.
Image Processing
- Convert to grayscale
- Upscale to target DPI
- Filter using Gauss kernel
- Binarizes the resulting images using Leptonica and the Otsu thresholding algorithm.
- Despeckle using various morphological operations
OCR Processing
Runs Tesseract on the images to extract text.
Font style detection
Detection of bold text using stroke width estimation
Text Integration
Draws the resulting text onto the original PDF using PDFBox.

Steps 2.-5. happen in parallel and communicate via a blocking queue to limit RAM usage. Therefore, choosing your thread counts carefully leads to most optimal performance. For example with 18 available cores, I achieved the highest performance with 2 Image extraction threads, 2 ghostscript processes and 16 OCR threads.

Setting all threads to basically unlimited (1000+) leads to comparable performance without laborious thread tuning, but at the cost of (potentially a lot) more RAM.

Installation

To run the OCR service, ensure that the following dependencies are installed:

Ghostscript: Install using apt.

sudo apt install ghostscript

Tesseract and Leptonica: Install using vcpkg with the command and set the environment variable VCPKG_DYNAMIC_LIB to your vcpkg lib folder (e.g. ~/vcpkg/installed/x64-linux-dynamic/lib).

vcpkg install tesseract --triplet x64-linux-dynamic

vcpkg install leptonica --triplet x64-linux-dynamic

Other dependencies are handled by Gradle build

gradle build

Configuration

Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
OCR_SERVICE_OCR_THREAD_COUNT=16

Possible configurations and their defaults include:

int ocrThreadCount = 4; // Number of OCR threads
int imageExtractThreadCount = 4; // Number of image extraction threads
int gsProcessCount = 4; // Number of Ghostscript processes
int dpi = 300; // Target DPI for binarized images
int psmOverride = -1; // Overrides the page segmentation mode if > 0
int minImageHeight = 20; // Minimum height for images to be processed
int minImageWidth = 20; // Minimum width for images to be processed
boolean debug = false; // If true, overlays OCR images with a grid and draws word bounding boxes
boolean removeWatermark; // If false, watermarks will not be removed
String languages = "deu+eng"; // Defines languages loaded into Tesseract as 3-char codes, additional languages must also be installed in the docker environment

Integration

The OCR-service communicates via RabbitMQ and uses the queues ocrQueue, ocrDLQ, and ocr_status_update_response_queue.

ocrQueue

This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.

ocr_status_update_response_queue

This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed. This queue is also used to signal the end of processing.

ocrDLQ

This queue is used to signal an error has occurred during processing.

Releases 64

Release 4.35.0 Latest

2024-08-20 10:09:34 +02:00

Languages

Java 99.2%

Shell 0.8%