# OCR Service ## Overview The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction. ## Dependencies [Leptonica](http://leptonica.org/) [Ghostscript](https://www.ghostscript.com/) [PDFTron](https://apryse.com/) [PDFBox](https://pdfbox.apache.org/) ## Functionality 1. Invisible Element and Watermark Removal The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF. 2. Image Extraction Extracts all images from the PDF using PDFBox 3. Image Processing Renders all pages with images using ghostscript and processes them using leptonica. 4. OCR Processing Calls the azure API in batches, receives text bbox and content. 5. Font style detection Detection of bold text using stroke width estimation 6. Text Integration Draws the resulting text onto the original PDF using PDFtron. Steps 2-5 are run in parallel. ## Installation To run the OCR service, no special dependencies are requires, just run: 1. Ghostscript: Install using apt. ```bash sudo apt install ghostscript ``` 2. Leptonica: Install using [vcpkg](https://github.com/microsoft/vcpkg) with the command and set the environment variable `VCPKG_DYNAMIC_LIB` to your vcpkg lib folder (e.g. ~ /vcpkg/installed/x64-linux-dynamic/lib). ``` vcpkg install leptonica --triplet x64-linux-dynamic ``` 3. Other dependencies are handled by Gradle build ```bash gradle build ``` The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT) ## Configuration Configuration settings are available in the OcrServiceSettings class. These settings can be overridden using environment variables. e.g. `OCR_SERVICE_OCR_THREAD_COUNT=16` Possible configurations and their defaults include: ```java // Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed. int concurrency = 8; // Limits the number of pages per call. int batchSize = 128; boolean debug; // writes the ocr layer visibly to the viewer doc pdf boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection. boolean drawTablesAsLines; // writes the tables to the PDF as invisible lines. boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure.... ``` ## Integration The OCR-service communicates via RabbitMQ and uses the queues `ocr_request_queue`, `ocr_response_queue`, `ocr_dead_letter_queue`, and `ocr_status_update_response_queue`. ### ocr_request_queue This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage. ### ocr_response_queue This queue is also used to signal the end of processing. ### ocr_dead_letter_queue This queue is used to signal an error has occurred during processing. ### ocr_status_update_response_queue This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than initially assumed.