Merge branch 'RED-9353' into 'main'

RED-9353: initial implementation

See merge request fforesight/azure-ocr-service!1
This commit is contained in:
Kilian Schüttler 2024-07-19 10:04:15 +02:00
commit 0b451d7155
95 changed files with 5833 additions and 75 deletions

25
.gitlab-ci.yml Normal file
View File

@ -0,0 +1,25 @@
variables:
# SONAR_PROJECT_KEY: 'ocr-service:ocr-service-server'
GIT_SUBMODULE_STRATEGY: recursive
GIT_SUBMODULE_FORCE_HTTPS: 'true'
include:
- project: 'gitlab/gitlab'
ref: 'main'
file: 'ci-templates/gradle_java.yml'
deploy:
stage: deploy
tags:
- dind
script:
- echo "Building with gradle version ${BUILDVERSION}"
- gradle -Pversion=${BUILDVERSION} publish
- gradle bootBuildImage --publishImage -PbuildbootDockerHostNetwork=true -Pversion=${BUILDVERSION}
- echo "BUILDVERSION=$BUILDVERSION" >> version.env
artifacts:
reports:
dotenv: version.env
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_COMMIT_BRANCH =~ /^release/
- if: $CI_COMMIT_TAG

156
README.md
View File

@ -1,93 +1,99 @@
# Azure Ocr Service
# OCR Service
## Overview
The OCR service is a tool designed for extracting text content from PDF files. It utilizes the Azure IDP endpoint for the extraction.
## Getting started
## Dependencies
To make it easy for you to get started with GitLab, here's a list of recommended next steps.
[Leptonica](http://leptonica.org/)
[Ghostscript](https://www.ghostscript.com/)
[PDFTron](https://apryse.com/)
[PDFBox](https://pdfbox.apache.org/)
Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
## Functionality
## Add your files
1. Invisible Element and Watermark Removal
The service uses PDFTron to attempt the removal of invisible elements and watermarks from the PDF.
2. Image Extraction
Extracts all images from the PDF using PDFBox
3. Image Processing
Renders all pages with images using ghostscript and processes them using leptonica.
4. OCR Processing
Calls the azure API in batches, receives text bbox and content.
5. Font style detection
Detection of bold text using stroke width estimation
6. Text Integration
Draws the resulting text onto the original PDF using PDFtron.
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command:
```
cd existing_repo
git remote add origin https://gitlab.knecon.com/fforesight/azure-ocr-service.git
git branch -M main
git push -uf origin main
```
## Integrate with your tools
- [ ] [Set up project integrations](https://gitlab.knecon.com/fforesight/azure-ocr-service/-/settings/integrations)
## Collaborate with your team
- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/)
- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html)
- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically)
- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/)
- [ ] [Set auto-merge](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html)
## Test and Deploy
Use the built-in continuous integration in GitLab.
- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html)
- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/)
- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html)
- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/)
- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html)
***
# Editing this README
When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template.
## Suggestions for a good README
Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information.
## Name
Choose a self-explaining name for your project.
## Description
Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors.
## Badges
On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge.
## Visuals
Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method.
Steps 2-5 are run in parallel.
## Installation
Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection.
## Usage
Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README.
To run the OCR service, no special dependencies are requires, just run:
## Support
Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc.
1. Ghostscript:
Install using apt.
## Roadmap
If you have ideas for releases in the future, it is a good idea to list them in the README.
```bash
sudo apt install ghostscript
```
## Contributing
State if you are open to contributions and what your requirements are for accepting them.
2. Leptonica:
Install using [vcpkg](https://github.com/microsoft/vcpkg) with the command
and set the environment variable `VCPKG_DYNAMIC_LIB` to your vcpkg lib folder (e.g. ~
/vcpkg/installed/x64-linux-dynamic/lib).
For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self.
```
vcpkg install leptonica --triplet x64-linux-dynamic
```
3. Other dependencies are handled by Gradle build
You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser.
```bash
gradle build
```
## Authors and acknowledgment
Show your appreciation to those who have contributed to the project.
The azure endpoint/key and pdftron license must be set using env variables (PDFTRON_LICENSE, AZURE_KEY, AZURE_ENDPOINT)
## License
For open source projects, say how it is licensed.
## Configuration
## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
Configuration settings are available in the OcrServiceSettings class.
These settings can be overridden using environment variables. e.g.
`OCR_SERVICE_OCR_THREAD_COUNT=16`
Possible configurations and their defaults include:
```java
// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
int concurrency = 8;
// Limits the number of pages per call.
int batchSize = 128;
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
boolean tableDetection; // writes the tables to the PDF as invisible lines.
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
boolean fontStyleDetection; // Enables bold detection using ghostscript and leptonica
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....
```
## Integration
The OCR-service communicates via RabbitMQ and uses the queues `ocr_request_queue`, `ocr_response_queue`,
`ocr_dead_letter_queue`, and `ocr_status_update_response_queue`.
### ocr_request_queue
This queue is used to start the OCR process, a DocumentRequest must be passed as a message. The service will then download the PDF from the provided cloud storage.
### ocr_response_queue
This queue is also used to signal the end of processing.
### ocr_dead_letter_queue
This queue is used to signal an error has occurred during processing.
### ocr_status_update_response_queue
This queue is used by the OCR service to give updates about the progress of the ongoing OCR on a image per image basis. The total amount may change, when less images are found than
initially assumed.

View File

@ -0,0 +1,5 @@
plugins {
`maven-publish`
id("com.knecon.fforesight.service.java-conventions")
id("io.freefair.lombok") version "8.4"
}

View File

@ -0,0 +1,25 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import java.util.ArrayList;
import java.util.List;
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Getter;
import lombok.experimental.FieldDefaults;
@Getter
@Builder
@AllArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class AzureAnalyzeResult {
@Builder.Default
List<KeyValuePair> keyValuePairs = new ArrayList<>();
@Builder.Default
List<TextRegion> handWrittenText = new ArrayList<>();
@Builder.Default
List<Figure> figures = new ArrayList<>();
}

View File

@ -0,0 +1,68 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import java.util.Optional;
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Getter;
import lombok.NoArgsConstructor;
import lombok.experimental.FieldDefaults;
@Getter
@Builder
@AllArgsConstructor
@NoArgsConstructor
@FieldDefaults(level = AccessLevel.PRIVATE)
public class DocumentRequest {
String dossierId;
String fileId;
String originDocumentId;
String viewerDocId;
String idpResultId;
boolean removeWatermarks;
public DocumentRequest(String dossierId, String fileId) {
this.dossierId = dossierId;
this.fileId = fileId;
originDocumentId = null;
viewerDocId = null;
idpResultId = null;
removeWatermarks = false;
}
// needed for backwards compatibility
public DocumentRequest(String dossierId, String fileId, boolean removeWatermarks) {
this.dossierId = dossierId;
this.fileId = fileId;
this.removeWatermarks = removeWatermarks;
originDocumentId = null;
viewerDocId = null;
idpResultId = null;
}
public Optional<String> optionalOriginDocumentId() {
return Optional.ofNullable(originDocumentId);
}
public Optional<String> optionalViewerDocumentId() {
return Optional.ofNullable(originDocumentId);
}
public Optional<String> optionalIdpResultId() {
return Optional.ofNullable(originDocumentId);
}
}

View File

@ -0,0 +1,10 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import java.util.Optional;
import lombok.Builder;
@Builder
public record Figure(Optional<TextRegion> caption, Region image) {
}

View File

@ -0,0 +1,9 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import lombok.Builder;
@Builder
public record KeyValuePair(TextRegion key, TextRegion value) {
}

View File

@ -0,0 +1,20 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@AllArgsConstructor
@NoArgsConstructor
public class OCRStatusUpdateResponse {
private String fileId;
private int numberOfPagesToOCR;
private int numberOfOCRedPages;
private boolean ocrFinished;
private boolean ocrStarted;
}

View File

@ -0,0 +1,150 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import java.awt.geom.AffineTransform;
import java.awt.geom.Line2D;
import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.util.List;
import java.util.stream.Stream;
public record QuadPoint(Point2D a, Point2D b, Point2D c, Point2D d) {
/*
B _____ C
| |
A|_____|D
*/
public static QuadPoint fromRectangle2D(Rectangle2D rectangle2D) {
return new QuadPoint(new Point2D.Double(rectangle2D.getX(), rectangle2D.getY()),
new Point2D.Double(rectangle2D.getX(), rectangle2D.getMaxY()),
new Point2D.Double(rectangle2D.getMaxX(), rectangle2D.getMaxY()),
new Point2D.Double(rectangle2D.getMaxX(), rectangle2D.getY()));
}
public static QuadPoint fromPolygons(List<Double> polygon) {
assert polygon.size() == 8;
return new QuadPoint(new Point2D.Double(polygon.get(0), polygon.get(1)),
new Point2D.Double(polygon.get(6), polygon.get(7)),
new Point2D.Double(polygon.get(4), polygon.get(5)),
new Point2D.Double(polygon.get(2), polygon.get(3)));
}
public Rectangle2D getBounds2D() {
double minX = Math.min(Math.min(Math.min(a.getX(), b.getX()), c.getX()), d.getX());
double minY = Math.min(Math.min(Math.min(a.getY(), b.getY()), c.getY()), d.getY());
double maxX = Math.max(Math.max(Math.max(a.getX(), b.getX()), c.getX()), d.getX());
double maxY = Math.max(Math.max(Math.max(a.getY(), b.getY()), c.getY()), d.getY());
return new Rectangle2D.Double(minX, minY, maxX - minX, maxY - minY);
}
public static QuadPoint fromData(QuadPointData data) {
return new QuadPoint(new Point2D.Double(data.values()[0], data.values()[1]),
new Point2D.Double(data.values()[2], data.values()[3]),
new Point2D.Double(data.values()[4], data.values()[5]),
new Point2D.Double(data.values()[6], data.values()[7]));
}
public Stream<Line2D> asLines() {
return Stream.of(new Line2D.Double(a(), b()), new Line2D.Double(b(), c()), new Line2D.Double(c(), d()), new Line2D.Double(d(), a()));
}
public QuadPointData data() {
return new QuadPointData(new float[]{(float) a.getX(), (float) a.getY(), (float) b.getX(), (float) b.getY(), (float) c.getX(), (float) c.getY(), (float) d.getX(), (float) d.getY()});
}
public QuadPoint getTransformed(AffineTransform at) {
return new QuadPoint(at.transform(a, null), at.transform(b, null), at.transform(c, null), at.transform(d, null));
}
/**
* Determines if the given QuadPoint aligns with this QuadPoint within a given threshold.
* It does os by trying every possible combination of aligning sides. It starts with the most likely combination of ab and cd.
*
* @param other The QuadPoint to compare with.
* @param threshold The maximum distance allowed for alignment.
* @return True if the QuadPoints align within the threshold, false otherwise.
*/
public boolean aligns(QuadPoint other, double threshold) {
Line2D ab = new Line2D.Double(a, b);
Line2D bc = new Line2D.Double(b, c);
Line2D cd = new Line2D.Double(c, d);
Line2D da = new Line2D.Double(d, a);
Line2D ab2 = new Line2D.Double(other.a, other.b);
Line2D bc2 = new Line2D.Double(other.b, other.c);
Line2D cd2 = new Line2D.Double(other.c, other.d);
Line2D da2 = new Line2D.Double(other.d, other.a);
List<Line2D> lines = List.of(ab, cd, bc, da);
List<Line2D> lines2 = List.of(cd2, ab2, bc2, da2);
return lines.stream()
.anyMatch(line -> lines2.stream()
.anyMatch(line2 -> aligns(line, line2, threshold)));
}
private static boolean aligns(Line2D a, Line2D b, double threshold) {
return aligns(a.getP1(), a.getP2(), b.getP1(), b.getP2(), threshold);
}
private static boolean aligns(Point2D a, Point2D b, Point2D a2, Point2D b2, double threshold) {
if (a.distance(a2) < threshold && b.distance(b2) < threshold) {
return true;
}
return a.distance(b2) < threshold && b.distance(a2) < threshold;
}
@Override
public String toString() {
return String.format("A:(%.2f, %.2f) | B:(%.2f, %.2f) | C:(%.2f, %.2f) | D:(%.2f, %.2f)",
a().getX(),
a().getY(),
b().getX(),
b().getY(),
c().getX(),
c().getY(),
d().getX(),
d().getY());
}
public double size() {
return a().distance(b()) * a().distance(d());
}
public double angle() {
double deltaY = d.getY() - a.getY();
double deltaX = d.getX() - a.getX();
return Math.atan2(deltaY, deltaX);
}
}

View File

@ -0,0 +1,8 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import lombok.Builder;
@Builder
public record QuadPointData(float[] values) {
}

View File

@ -0,0 +1,8 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import lombok.Builder;
@Builder
public record Region(int pageNumber, QuadPointData bbox) {
}

View File

@ -0,0 +1,8 @@
package com.knecon.fforesight.service.ocr.v1.api.model;
import lombok.Builder;
@Builder
public record TextRegion(Region region, String text) {
}

View File

@ -0,0 +1,28 @@
plugins {
id("com.knecon.fforesight.service.java-conventions")
id("io.freefair.lombok") version "8.4"
}
configurations {
all {
exclude(group = "org.springframework.boot", module = "spring-boot-starter-logging")
}
}
dependencies {
api(project(":azure-ocr-service-api"))
api("com.iqser.red.service:persistence-service-internal-api-v1:2.224.0")
api("net.sourceforge.tess4j:tess4j:5.8.0")
api("com.iqser.red.commons:metric-commons:2.1.0")
api("com.iqser.red.commons:storage-commons:2.49.0")
api("com.knecon.fforesight:tenant-commons:0.26.0")
api("com.pdftron:PDFNet:10.7.0")
api("org.apache.pdfbox:pdfbox:3.0.0")
api("org.apache.commons:commons-math3:3.6.1")
api("com.amazonaws:aws-java-sdk-kms:1.12.440")
api("com.google.guava:guava:31.1-jre")
api("com.iqser.red.commons:pdftron-logic-commons:2.27.0")
api("com.knecon.fforesight:viewer-doc-processor:0.148.0")
api("com.azure:azure-ai-documentintelligence:1.0.0-beta.3")
testImplementation("org.junit.jupiter:junit-jupiter:5.8.1")
}

View File

@ -0,0 +1,25 @@
package com.knecon.fforesight.service.ocr.processor;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.context.properties.EnableConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import com.knecon.fforesight.service.viewerdoc.service.PDFTronViewerDocumentService;
import io.micrometer.observation.ObservationRegistry;
@Configuration
@ComponentScan
@EnableConfigurationProperties(OcrServiceSettings.class)
public class OcrServiceProcessorConfiguration {
@Bean
@Autowired
public PDFTronViewerDocumentService viewerDocumentService(ObservationRegistry registry) {
return new PDFTronViewerDocumentService(registry);
}
}

View File

@ -0,0 +1,26 @@
package com.knecon.fforesight.service.ocr.processor;
import org.springframework.boot.context.properties.ConfigurationProperties;
import lombok.AccessLevel;
import lombok.Data;
import lombok.experimental.FieldDefaults;
@Data
@ConfigurationProperties("ocr-service")
@FieldDefaults(level = AccessLevel.PRIVATE)
public class OcrServiceSettings {
// Limits the number of concurrent calls to the azure API. In my very rudimentary testing, azure starts throwing "too many requests" errors at around 80/s. Higher numbers greatly improve the speed.
int concurrency = 8;
// Limits the number of pages per call.
int batchSize = 128;
boolean debug; // writes the ocr layer visibly to the viewer doc pdf
boolean idpEnabled; // Enables table detection, paragraph classification, section detection, key-value detection.
boolean tableDetection; // writes the tables to the PDF as invisible lines.
boolean processAllPages; // if this parameter is set, ocr will be performed on any page, regardless if it has images or not
boolean fontStyleDetection = true; // Enables bold detection using ghostscript and leptonica
String contentFormat; // Either markdown or text. But, for whatever reason, with markdown enabled, key-values are not written by azure....
}

View File

@ -0,0 +1,44 @@
package com.knecon.fforesight.service.ocr.processor.initializer;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Component;
import com.pdftron.pdf.PDFNet;
import com.sun.jna.NativeLibrary;
import jakarta.annotation.PostConstruct;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Component
@RequiredArgsConstructor
public class NativeLibrariesInitializer {
@Value("${pdftron.license:}")
private String pdftronLicense;
@SneakyThrows
@PostConstruct
// Do not change back to application runner, if it is application runner it takes messages from the queue before PDFNet is initialized, that leads to UnsatisfiedLinkError.
public void init() {
log.info("Initializing Native Libraries");
log.info("Setting pdftron license: {}", pdftronLicense);
PDFNet.setTempPath("/tmp/pdftron");
PDFNet.initialize(pdftronLicense);
log.info("Setting jna.library.path: {}", System.getenv("VCPKG_DYNAMIC_LIB"));
System.setProperty("jna.library.path", System.getenv("VCPKG_DYNAMIC_LIB"));
log.info("Asserting Native Libraries loaded");
try (NativeLibrary leptonicaLib = NativeLibrary.getInstance("leptonica")) {
assert leptonicaLib != null;
log.info("Leptonica library loaded from {}", leptonicaLib.getFile().getAbsolutePath());
}
}
}

View File

@ -0,0 +1,13 @@
package com.knecon.fforesight.service.ocr.processor.model;
import net.sourceforge.lept4j.Leptonica1;
import net.sourceforge.lept4j.Pix;
public record ImageFile(int pageNumber, String absoluteFilePath) {
public Pix readPix() {
return Leptonica1.pixRead(absoluteFilePath);
}
}

View File

@ -0,0 +1,87 @@
package com.knecon.fforesight.service.ocr.processor.model;
import static com.knecon.fforesight.service.ocr.processor.utils.ListSplittingUtils.formatIntervals;
import java.util.ArrayList;
import java.util.List;
import java.util.function.Consumer;
import lombok.AccessLevel;
import lombok.NonNull;
import lombok.experimental.FieldDefaults;
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public final class PageBatch implements Comparable<PageBatch> {
@NonNull
List<Integer> lookup = new ArrayList<>();
@Override
public String toString() {
if (size() == 1) {
return String.format("%d", lookup.get(0));
}
List<String> intervals = formatIntervals(lookup);
if (intervals.size() > 4) {
intervals = intervals.subList(0, 4);
intervals.add("...");
}
return String.join(", ", intervals);
}
public void add(Integer pageNumber) {
lookup.add(pageNumber);
}
public void forEach(Consumer<? super Integer> consumer) {
lookup.forEach(consumer);
}
public List<Integer> getAllPageNumbers() {
return lookup;
}
public int size() {
return lookup.size();
}
public boolean isEmpty() {
return lookup.isEmpty();
}
public int getPageNumber(int pageNumber) {
return lookup.get(pageNumber - 1);
}
@Override
public int compareTo(PageBatch o) {
if (lookup.isEmpty() && o.lookup.isEmpty()) {
return 0;
} else if (lookup.isEmpty()) {
return 1;
} else if (o.lookup.isEmpty()) {
return -1;
}
return Integer.compare(lookup.get(0), o.lookup.get(0));
}
}

View File

@ -0,0 +1,67 @@
package com.knecon.fforesight.service.ocr.processor.model;
import java.awt.geom.Rectangle2D;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import com.knecon.fforesight.service.ocr.processor.utils.DocumentTextExtractor;
import com.pdftron.pdf.PDFDoc;
import com.pdftron.pdf.Page;
import com.pdftron.pdf.PageIterator;
import com.pdftron.pdf.Rect;
import lombok.SneakyThrows;
public record PageInformation(Rectangle2D mediabox, int number, int rotationDegrees, List<Rectangle2D> wordBBoxes) {
@SneakyThrows
public static Map<Integer, PageInformation> fromPDFDoc(PDFDoc pdfDoc) {
ConcurrentHashMap<Integer, PageInformation> pageInformationMap = new ConcurrentHashMap<>();
int pageNumber = 1;
for (PageIterator iterator = pdfDoc.getPageIterator(); iterator.hasNext(); pageNumber++) {
Page page = iterator.next();
pageInformationMap.put(pageNumber, PageInformation.fromPage(pageNumber, page));
}
return pageInformationMap;
}
@SneakyThrows
public static PageInformation fromPage(int pageNum, Page page) {
try (Rect mediaBox = page.getCropBox()) {
return new PageInformation(new Rectangle2D.Double(mediaBox.getX1(), mediaBox.getY1(), mediaBox.getWidth(), mediaBox.getHeight()),
pageNum,
page.getRotation() * 90,
DocumentTextExtractor.getTextBBoxes(page));
}
}
public double height() {
return mediabox.getHeight();
}
public double width() {
return mediabox.getWidth();
}
public double minX() {
return mediabox.getX();
}
public double minY() {
return mediabox.getY();
}
}

View File

@ -0,0 +1,138 @@
package com.knecon.fforesight.service.ocr.processor.model;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
import java.util.function.Function;
import java.util.stream.Stream;
import com.azure.ai.documentintelligence.models.DocumentSpan;
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.experimental.FieldDefaults;
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class SpanLookup<T> {
List<SpanAndObject> spans;
Function<T, DocumentSpan> mapperFunction;
public SpanLookup(Stream<T> spans, Function<T, DocumentSpan> mappingFunction) {
this.mapperFunction = mappingFunction;
this.spans = spans.map(o -> new SpanAndObject(this.mapperFunction.apply(o), o))
.sorted()
.toList();
}
public boolean containedInAnySpan(T span) {
if (spans.isEmpty()) {
return false;
}
var tr = new SpanAndObject(this.mapperFunction.apply(span), null);
int idx = findIdxOfFirstSmallerObject(tr);
if (idx < 0) {
return false;
}
return spans.get(idx).contains(mapperFunction.apply(span));
}
public List<T> findElementsContainedInSpan(DocumentSpan span) {
if (spans.isEmpty()) {
return Collections.emptyList();
}
var range = new SpanAndObject(span, null);
int idx = findIdxOfFirstSmallerObject(range);
if (idx < 0) {
return Collections.emptyList();
}
List<T> result = new LinkedList<>();
for (int i = idx; i < spans.size(); i++) {
if (range.contains(spans.get(i))) {
result.add(spans.get(i).object);
} else {
break;
}
}
return result;
}
private int findIdxOfFirstSmallerObject(SpanAndObject range) {
int idx = Collections.binarySearch(spans, range); // Returns: the index of the search key, if it is contained in the list; otherwise, (-(insertion point) - 1)
if (idx >= 0) {
return idx;
} else {
int insertionPoint = -(idx + 1);
if (insertionPoint == 0) {
return -1;
}
var lastSmaller = spans.get(insertionPoint - 1);
for (int resultIdx = insertionPoint - 2; resultIdx >= 0; resultIdx--) {
if (spans.get(resultIdx).compareTo(lastSmaller) == 0) {
return resultIdx + 1;
}
}
return 0;
}
}
@AllArgsConstructor
private class SpanAndObject implements Comparable<SpanAndObject> {
DocumentSpan range;
T object;
@Override
public int compareTo(SpanAndObject o) {
return Integer.compare(range.getOffset(), o.range.getOffset());
}
public int start() {
return range.getOffset();
}
public int end() {
return range.getOffset() + range.getLength();
}
public boolean contains(SpanAndObject other) {
return this.start() <= other.start() && other.end() <= this.end();
}
public boolean contains(DocumentSpan span) {
return this.contains(new SpanAndObject(span, null));
}
}
}

View File

@ -0,0 +1,226 @@
package com.knecon.fforesight.service.ocr.processor.model;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;
import java.util.stream.Collectors;
import com.knecon.fforesight.service.ocr.processor.service.BatchStats;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.experimental.FieldDefaults;
@FieldDefaults(level = AccessLevel.PRIVATE)
public class Statistics {
@Getter
final Map<PageBatch, BatchStats> batchStats = new ConcurrentHashMap<>();
@Getter
long imageExtractionDuration;
long drawingPDFDuration;
long timestamp;
final AtomicLong totalUploadedBytes = new AtomicLong(0);
final AtomicLong totalUploadedPages = new AtomicLong(0);
long awaitingOcrDuration;
public void imageExtractionFinished() {
imageExtractionDuration = System.currentTimeMillis() - timestamp;
setTimestamp();
}
public BatchStats getBatchStats(PageBatch batch) {
if (!batchStats.containsKey(batch)) {
batchStats.put(batch, new BatchStats());
}
return batchStats.get(batch);
}
@Override
public String toString() {
return String.format("""
uploaded pages: %d
uploaded bytes: %s
image extraction: %s
awaiting ocr result: %s
drawing pdf: %s
%d batch(es): %s
Per batch:
upload\\queued: %s
waiting for api: %s
mapping result: %s
batch rendering: %s""",
totalUploadedPages.get(),
humanizeBytes(totalUploadedBytes.get()),
humanizeDuration(imageExtractionDuration),
humanizeDuration(awaitingOcrDuration),
humanizeDuration(drawingPDFDuration),
batchStats.keySet().size(),
buildBatchesString(),
humanizeDurationArray(imageUpload()),
humanizeDurationArray(apiWaitDuration()),
humanizeDurationArray(mappingResultDuration()),
humanizeDurationArray(batchRenderDuration()));
}
public static String humanizeBytes(long bytes) {
if (bytes < 1024) {
return bytes + "B";
} else if (bytes < 1024 * 1024) {
return String.format("%.1fKB", (float) bytes / 1024);
} else if (bytes < 1024 * 1024 * 1024) {
return String.format("%.1fMB", (float) bytes / 1024 / 1024);
} else {
return String.format("%.1fGB", (float) bytes / 1024 / 1024 / 1024);
}
}
public static String humanizeDuration(long duration) {
if (duration < 1000) {
return duration + " ms";
} else if (duration < 60 * 1000) {
double seconds = duration / 1000.0;
return String.format("%.1f s", seconds);
} else if (duration < 60 * 60 * 1000) {
long minutes = duration / (60 * 1000);
long remainingMillis = duration % (60 * 1000);
double seconds = remainingMillis / 1000.0;
return String.format("%d:%.1f m", minutes, seconds);
} else {
long hours = duration / (60 * 60 * 1000);
long remainingMillis = duration % (60 * 60 * 1000);
long minutes = remainingMillis / (60 * 1000);
remainingMillis = remainingMillis % (60 * 1000);
double seconds = remainingMillis / 1000.0;
return String.format("%d:%d:%.1f h", hours, minutes, seconds);
}
}
private String buildBatchesString() {
return "(" + batchStats.keySet()
.stream()
.sorted()
.map(PageBatch::toString)
.collect(Collectors.joining("); (")) + ")";
}
private long[] batchRenderDuration() {
return batchStats.values()
.stream()
.mapToLong(BatchStats::getBatchRenderDuration)
.toArray();
}
public void increaseTotalBytes(PageBatch pageRange, long bytes) {
totalUploadedPages.addAndGet(pageRange.size());
totalUploadedBytes.addAndGet(bytes);
}
public void ocrFinished() {
awaitingOcrDuration = System.currentTimeMillis() - timestamp;
setTimestamp();
}
private long[] imageUpload() {
return batchStats.values()
.stream()
.mapToLong(BatchStats::getImageUploadDuration)
.toArray();
}
private long[] apiWaitDuration() {
return batchStats.values()
.stream()
.mapToLong(BatchStats::getApiWaitDuration)
.toArray();
}
private long[] mappingResultDuration() {
return batchStats.values()
.stream()
.mapToLong(BatchStats::getWritingTextDuration)
.toArray();
}
public String humanizeDurationArray(long[] timeArray) {
double avgMs = average(timeArray);
double stdDevMs = standardDeviation(timeArray, avgMs);
if (avgMs < 1000) {
return String.format("%.0f ms ± %.0f ms", avgMs, stdDevMs);
}
double avgSeconds = avgMs / 1000;
double stdDevSeconds = stdDevMs / 1000;
return String.format("%.1fs ± %.1fs", avgSeconds, stdDevSeconds);
}
private static double average(long[] times) {
double sum = 0;
for (long time : times) {
sum += time;
}
return sum / times.length;
}
private static double standardDeviation(long[] times, double mean) {
double sumSquaredDiffs = 0;
for (long time : times) {
double diff = time - mean;
sumSquaredDiffs += diff * diff;
}
return Math.sqrt(sumSquaredDiffs / times.length);
}
public void setTimestamp() {
this.timestamp = System.currentTimeMillis();
}
public void drawingPdfFinished() {
this.drawingPDFDuration = System.currentTimeMillis() - timestamp;
setTimestamp();
}
public void setStart() {
this.timestamp = System.currentTimeMillis();
}
}

View File

@ -0,0 +1,132 @@
package com.knecon.fforesight.service.ocr.processor.model;
import java.awt.geom.AffineTransform;
import java.awt.geom.Point2D;
import com.azure.ai.documentintelligence.models.DocumentWord;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontMetrics;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontMetricsProvider;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontStyle;
import com.knecon.fforesight.service.ocr.v1.api.model.QuadPoint;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.Setter;
import lombok.experimental.FieldDefaults;
@Getter
@FieldDefaults(level = AccessLevel.PRIVATE)
public class TextPositionInImage {
final QuadPoint position;
final String text;
final AffineTransform imageCTM;
@Setter
boolean overlapsIgnoreZone;
@Setter
FontMetricsProvider fontMetricsProvider;
@Setter
FontStyle fontStyle;
public TextPositionInImage(DocumentWord word, AffineTransform imageCTM, FontMetricsProvider fontMetricsProvider, FontStyle fontStyle) {
this.position = QuadPoint.fromPolygons(word.getPolygon());
this.text = word.getContent();
this.imageCTM = imageCTM;
this.fontMetricsProvider = fontMetricsProvider;
this.fontStyle = fontStyle;
}
public QuadPoint getTransformedTextBBox() {
return position.getTransformed(imageCTM);
}
public AffineTransform getTextMatrix() {
FontMetrics metrics = fontMetricsProvider.calculateMetrics(text, getTransformedWidth(), getTransformedHeight());
// Matrix multiplication is from right to left:
// convert to image coords -> subtract descent -> scale height -> reverse imageCTM scaling -> translate to coordinates in image -> convert to pdf coords
// width must not be set, since it is scaled with the fontsize attribute
double rotation = position.angle();
Point2D anchor = new Point2D.Double(position.b().getX(), position.b().getY());
AffineTransform ctm = new AffineTransform();
ctm.concatenate(imageCTM);
ctm.translate(anchor.getX(), anchor.getY());
ctm.scale(getWidth() / getTransformedWidth(),
getHeight() / getTransformedHeight()); // scale with transformation coefficient, such that fontsize may be set with transformed width.
ctm.rotate(rotation);
ctm.scale(1, metrics.getHeightScaling());
ctm.translate(0, metrics.getDescent());
ctm.concatenate(new AffineTransform(1, 0, 0, -1, 0, 0)); // start in image coordinates, with (0,0) being top left and negative height.
return ctm;
}
public double getFontSize() {
// The fontsize as estimated by the word width
return fontMetricsProvider.calculateFontSize(text, getTransformedWidth());
}
public double getTransformedWidth() {
return transformedA().distance(transformedD());
}
public double getTransformedHeight() {
return transformedA().distance(transformedB());
}
public double getWidth() {
return position.a().distance(position.d());
}
public double getFontSizeByHeight() {
// The fontsize as estimated by the word height, only used for font style detection
var metrics = fontMetricsProvider.calculateMetrics(text, getTransformedWidth(), getTransformedHeight());
return fontMetricsProvider.calculateFontSize(text, getTransformedWidth()) * metrics.getHeightScaling();
}
public double getHeight() {
return position.a().distance(position.b());
}
public Point2D transformedA() {
return imageCTM.transform(position.a(), null);
}
public Point2D transformedB() {
return imageCTM.transform(position.b(), null);
}
public Point2D transformedC() {
return imageCTM.transform(position.c(), null);
}
public Point2D transformedD() {
return imageCTM.transform(position.d(), null);
}
}

View File

@ -0,0 +1,189 @@
package com.knecon.fforesight.service.ocr.processor.model;
import static java.lang.String.format;
import java.util.Collection;
import java.util.LinkedList;
import java.util.List;
import com.azure.ai.documentintelligence.models.DocumentSpan;
import lombok.EqualsAndHashCode;
import lombok.Setter;
/**
* Represents a range of text defined by a start and end index.
* Provides functionality to check containment, intersection, and to adjust ranges based on specified conditions.
*/
@Setter
@EqualsAndHashCode
@SuppressWarnings("PMD.AvoidFieldNameMatchingMethodName")
public class TextRange implements Comparable<TextRange> {
private int start;
private int end;
public static TextRange fromDocumentSpan(DocumentSpan span) {
return new TextRange(span.getOffset(), span.getOffset() + span.getLength());
}
/**
* Constructs a TextRange with specified start and end indexes.
*
* @param start The starting index of the range.
* @param end The ending index of the range.
* @throws IllegalArgumentException If start is greater than end.
*/
public TextRange(int start, int end) {
if (start > end) {
throw new IllegalArgumentException(format("start: %d > end: %d", start, end));
}
this.start = start;
this.end = end;
}
/**
* Returns the length of the text range.
*
* @return The length of the range.
*/
public int length() {
return end - start;
}
public int start() {
return start;
}
public int end() {
return end;
}
/**
* Checks if this {@link TextRange} fully contains another TextRange.
*
* @param textRange The {@link TextRange} to check.
* @return true if this range contains the specified range, false otherwise.
*/
public boolean contains(TextRange textRange) {
return start <= textRange.start() && textRange.end() <= end;
}
/**
* Checks if this {@link TextRange} is fully contained by another TextRange.
*
* @param textRange The {@link TextRange} to check against.
* @return true if this range is contained by the specified range, false otherwise.
*/
public boolean containedBy(TextRange textRange) {
return textRange.contains(this);
}
/**
* Checks if this {@link TextRange} contains another range specified by start and end indices.
*
* @param start The starting index of the range to check.
* @param end The ending index of the range to check.
* @return true if this range fully contains the specified range, false otherwise.
* @throws IllegalArgumentException If the start index is greater than the end index.
*/
public boolean contains(int start, int end) {
if (start > end) {
throw new IllegalArgumentException(format("start: %d > end: %d", start, end));
}
return this.start <= start && end <= this.end;
}
/**
* Checks if this {@link TextRange} is fully contained within another range specified by start and end indices.
*
* @param start The starting index of the outer range.
* @param end The ending index of the outer range.
* @return true if this range is fully contained within the specified range, false otherwise.
* @throws IllegalArgumentException If the start index is greater than the end index.
*/
public boolean containedBy(int start, int end) {
if (start > end) {
throw new IllegalArgumentException(format("start: %d > end: %d", start, end));
}
return start <= this.start && this.end <= end;
}
/**
* Determines if the specified index is within this {@link TextRange}.
*
* @param index The index to check.
* @return true if the index is within the range (inclusive of the start and exclusive of the end), false otherwise.
*/
public boolean contains(int index) {
return start <= index && index < end;
}
/**
* Checks if this {@link TextRange} intersects with another {@link TextRange}.
*
* @param textRange The {@link TextRange} to check for intersection.
* @return true if the ranges intersect, false otherwise.
*/
public boolean intersects(TextRange textRange) {
return textRange.start() < this.end && this.start < textRange.end();
}
/**
* Merges a collection of TextRanges into a single Text range encompassing all.
*
* @param boundaries The collection of TextRanges to merge.
* @return A new TextRange covering the entire span of the given ranges.
* @throws IllegalArgumentException If boundaries are empty.
*/
public static TextRange merge(Collection<TextRange> boundaries) {
int minStart = boundaries.stream()
.mapToInt(TextRange::start)
.min()
.orElseThrow(IllegalArgumentException::new);
int maxEnd = boundaries.stream()
.mapToInt(TextRange::end)
.max()
.orElseThrow(IllegalArgumentException::new);
return new TextRange(minStart, maxEnd);
}
@Override
public String toString() {
return format("Boundary [%d|%d)", start, end);
}
@Override
public int compareTo(TextRange textRange) {
return Integer.compare(start, textRange.start());
}
}

View File

@ -0,0 +1,177 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.springframework.stereotype.Service;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.azure.core.util.BinaryData;
import com.azure.core.util.polling.LongRunningOperationStatus;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.PageInformation;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingSupervisor;
import com.knecon.fforesight.service.ocr.processor.visualizations.layers.LayerFactory;
import com.knecon.fforesight.service.ocr.processor.visualizations.layers.OcrResult;
import com.pdftron.common.PDFNetException;
import com.pdftron.pdf.PDFDoc;
import com.pdftron.sdf.SDFDoc;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
import reactor.core.publisher.Mono;
@Slf4j
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class AsyncOcrService {
AzureOcrResource azureOcrResource;
OcrServiceSettings settings;
public OcrResult awaitOcr(PDFDoc pdfDoc,
OcrExecutionSupervisor supervisor,
Set<Integer> pagesWithImages,
ImageProcessingSupervisor imageSupervisor) throws InterruptedException, PDFNetException {
LayerFactory layerFactory = new LayerFactory(settings, supervisor, imageSupervisor, PageInformation.fromPDFDoc(pdfDoc));
List<PageBatch> batches = splitIntoBatches(pdfDoc, supervisor, pagesWithImages);
for (PageBatch batch : batches) {
if (batch.isEmpty()) {
continue;
}
BatchContext batchContext = new BatchContext(layerFactory, supervisor, batch);
supervisor.requireNoErrors();
batchContext.batchStats().start();
BinaryData data = renderBatch(pdfDoc, batch);
batchContext.batchStats().batchRenderFinished();
beginAnalysis(data, batchContext);
}
supervisor.awaitAllPagesProcessed();
return layerFactory.getLayers();
}
private static BinaryData renderBatch(PDFDoc pdfDoc, PageBatch batch) throws PDFNetException {
BinaryData docData;
try (var smallerDoc = extractBatchDocument(pdfDoc, batch)) {
docData = BinaryData.fromBytes(smallerDoc.save(SDFDoc.SaveMode.LINEARIZED, null));
}
return docData;
}
private List<PageBatch> splitIntoBatches(PDFDoc pdfDoc, OcrExecutionSupervisor supervisor, Set<Integer> pagesWithImages) throws PDFNetException {
List<PageBatch> batches = new ArrayList<>();
PageBatch currentBatch = new PageBatch();
batches.add(currentBatch);
for (int pageNumber = 1; pageNumber <= pdfDoc.getPageCount(); pageNumber++) {
if (!settings.isProcessAllPages() && !pagesWithImages.contains(pageNumber)) {
supervisor.logPageSkipped(pageNumber);
continue;
}
currentBatch.add(pageNumber);
if (currentBatch.size() == settings.getBatchSize()) {
currentBatch = new PageBatch();
batches.add(currentBatch);
}
}
return batches;
}
private void beginAnalysis(BinaryData data, BatchContext batchContext) throws InterruptedException {
batchContext.supervisor.enterConcurrency(batchContext.batch);
batchContext.supervisor.logUploadStart(batchContext.batch, data.getLength());
azureOcrResource.callAzureAsync(data)
.flatMap(response -> {
if (response.getStatus().equals(LongRunningOperationStatus.IN_PROGRESS)) {
batchContext.supervisor.logInProgress(batchContext.batch);
}
if (!response.getStatus().isComplete()) {
return Mono.empty();
}
if (LongRunningOperationStatus.SUCCESSFULLY_COMPLETED == response.getStatus()) {
return response.getFinalResult();
}
return Mono.error(new IllegalStateException("Polling completed unsuccessfully with status: " + response.getStatus()));
}).subscribe(finalResult -> handleSuccessful(finalResult, batchContext),//
ex -> handleError(ex, batchContext),//
() -> handleCompleted(batchContext));
}
private static void handleCompleted(BatchContext batchContext) {
batchContext.supervisor.leaveConcurrency(batchContext.batch);
}
private void handleError(Throwable ex, BatchContext batchContext) {
batchContext.supervisor.logPageError(batchContext.batch, ex);
}
private void handleSuccessful(AnalyzeResult finalResult, BatchContext batchContext) {
try {
batchContext.layerFactory.addAnalyzeResult(batchContext.batch, finalResult);
batchContext.supervisor.logPageSuccess(batchContext.batch);
} catch (Exception e) {
handleError(e, batchContext);
}
}
private static PDFDoc extractBatchDocument(PDFDoc pdfDoc, PageBatch pageBatch) throws PDFNetException {
if (pageBatch.size() < 0) {
throw new IllegalArgumentException();
}
PDFDoc singlePagePdfDoc = new PDFDoc();
pageBatch.forEach(pageNumber -> addPageToNewDoc(pageNumber, pdfDoc, singlePagePdfDoc));
return singlePagePdfDoc;
}
@SneakyThrows
private static void addPageToNewDoc(Integer pageNumber, PDFDoc pdfDoc, PDFDoc singlePagePdfDoc) {
singlePagePdfDoc.pagePushBack(pdfDoc.getPage(pageNumber));
}
private record BatchContext(LayerFactory layerFactory, OcrExecutionSupervisor supervisor, PageBatch batch) {
BatchStats batchStats() {
return supervisor.getStatistics().getBatchStats(batch);
}
}
}

View File

@ -0,0 +1,84 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.util.ArrayList;
import java.util.List;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;
import com.azure.ai.documentintelligence.DocumentIntelligenceAsyncClient;
import com.azure.ai.documentintelligence.DocumentIntelligenceClientBuilder;
import com.azure.ai.documentintelligence.models.AnalyzeDocumentRequest;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.azure.ai.documentintelligence.models.AnalyzeResultOperation;
import com.azure.ai.documentintelligence.models.ContentFormat;
import com.azure.ai.documentintelligence.models.DocumentAnalysisFeature;
import com.azure.ai.documentintelligence.models.StringIndexType;
import com.azure.core.credential.AzureKeyCredential;
import com.azure.core.util.BinaryData;
import com.azure.core.util.polling.PollerFlux;
import com.google.common.base.Objects;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import lombok.AccessLevel;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class AzureOcrResource {
final OcrServiceSettings settings;
final DocumentIntelligenceAsyncClient asyncClient;
public AzureOcrResource(@Value("${azure.endpoint}") String azureEndpoint, @Value("${azure.key}") String azureKey, OcrServiceSettings settings) {
this.settings = settings;
this.asyncClient = new DocumentIntelligenceClientBuilder().credential(new AzureKeyCredential(azureKey)).endpoint(azureEndpoint).buildAsyncClient();
}
@SneakyThrows
public PollerFlux<AnalyzeResultOperation, AnalyzeResult> callAzureAsync(BinaryData data) {
AnalyzeDocumentRequest analyzeRequest = new AnalyzeDocumentRequest().setBase64Source(data.toBytes());
return asyncClient.beginAnalyzeDocument(getModelId(), null, null, StringIndexType.UTF16CODE_UNIT, buildFeatures(), null, buildContentFormat(), analyzeRequest);
}
private ContentFormat buildContentFormat() {
if (Objects.equal(settings.getContentFormat(), "markdown")) {
return ContentFormat.MARKDOWN;
}
return ContentFormat.TEXT;
}
private String getModelId() {
if (settings.isIdpEnabled()) {
return "prebuilt-layout";
}
return "prebuilt-read";
}
private List<DocumentAnalysisFeature> buildFeatures() {
var features = new ArrayList<DocumentAnalysisFeature>();
if (settings.isIdpEnabled()) {
features.add(DocumentAnalysisFeature.KEY_VALUE_PAIRS);
}
features.add(DocumentAnalysisFeature.BARCODES);
return features;
}
}

View File

@ -0,0 +1,64 @@
package com.knecon.fforesight.service.ocr.processor.service;
import lombok.AccessLevel;
import lombok.experimental.FieldDefaults;
@FieldDefaults(level = AccessLevel.PRIVATE)
public class BatchStats {
private long startTimestamp = -1;
private long apiWaitTimestamp = -1;
private long imageUploadTimestamp = -1;
private long writingTextTimestamp = -1;
private long batchRenderTimestamp = -1;
public void start() {
startTimestamp = System.currentTimeMillis();
}
public void batchRenderFinished() {
batchRenderTimestamp = System.currentTimeMillis();
}
public void finishUpload() {
imageUploadTimestamp = System.currentTimeMillis();
}
public void finishApiWait() {
apiWaitTimestamp = System.currentTimeMillis();
}
public void finishWritingText() {
writingTextTimestamp = System.currentTimeMillis();
}
public boolean isUploadFinished() {
return imageUploadTimestamp > 0;
}
public long getApiWaitDuration() {return this.apiWaitTimestamp - imageUploadTimestamp;}
public long getImageUploadDuration() {return this.imageUploadTimestamp - batchRenderTimestamp;}
public long getWritingTextDuration() {return this.writingTextTimestamp - apiWaitTimestamp;}
public long getBatchRenderDuration() {return this.batchRenderTimestamp - startTimestamp;}
}

View File

@ -0,0 +1,68 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.io.File;
import java.io.FileInputStream;
import java.nio.file.Files;
import org.springframework.stereotype.Service;
import com.iqser.red.service.persistence.service.v1.api.shared.model.dossiertemplate.dossier.file.FileType;
import com.iqser.red.storage.commons.service.StorageService;
import com.knecon.fforesight.service.ocr.v1.api.model.DocumentRequest;
import com.knecon.fforesight.tenantcommons.TenantContext;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@RequiredArgsConstructor
public class FileStorageService {
private final StorageService storageService;
public static String getStorageId(String dossierId, String fileId, FileType fileType) {
return dossierId + "/" + fileId + "." + fileType.name() + fileType.getExtension();
}
@SneakyThrows
public void storeFiles(DocumentRequest request, File documentFile, File viewerDocumentFile, File analyzeResultFile) {
try (var in = new FileInputStream(viewerDocumentFile)) {
if (request.optionalViewerDocumentId().isPresent()) {
storageService.storeObject(TenantContext.getTenantId(), request.getViewerDocId(), in);
} else {
storageService.storeObject(TenantContext.getTenantId(), getStorageId(request.getDossierId(), request.getFileId(), FileType.VIEWER_DOCUMENT), in);
}
}
try (var in = new FileInputStream(documentFile)) {
if (request.optionalOriginDocumentId().isPresent()) {
storageService.storeObject(TenantContext.getTenantId(), request.getOriginDocumentId(), in);
} else {
storageService.storeObject(TenantContext.getTenantId(), getStorageId(request.getDossierId(), request.getFileId(), FileType.ORIGIN), in);
}
}
if (request.optionalIdpResultId().isPresent()) {
try (var in = new FileInputStream(analyzeResultFile)) {
storageService.storeObject(TenantContext.getTenantId(), request.getIdpResultId(), in);
}
}
}
@SneakyThrows
public void downloadFiles(DocumentRequest request, File documentFile) {
Files.createDirectories(documentFile.getParentFile().toPath());
String originDocumentId = request.optionalOriginDocumentId().orElse(getStorageId(request.getDossierId(), request.getFileId(), FileType.ORIGIN));
storageService.downloadTo(TenantContext.getTenantId(), originDocumentId, documentFile);
}
}

View File

@ -0,0 +1,16 @@
package com.knecon.fforesight.service.ocr.processor.service;
import org.springframework.stereotype.Service;
@Service
public interface IOcrMessageSender {
void sendUpdate(String fileId, int finishedImages, int totalImages);
void sendOCRStarted(String fileId);
void sendOcrFinished(String fileId, int totalImages);
void sendOcrResponse(String dossierId, String fileId);
}

View File

@ -0,0 +1,83 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.pdftron.common.PDFNetException;
import com.pdftron.pdf.Element;
import com.pdftron.pdf.ElementReader;
import com.pdftron.pdf.PDFDoc;
import lombok.SneakyThrows;
@Service
public class ImageDetectionService {
// any image with smaller height and width than this gets thrown out, see everyPointInDashedLineIsImage.pdf
private static final int PIXEL_THRESHOLD = 0;
private final OcrServiceSettings ocrServiceSettings;
public ImageDetectionService(OcrServiceSettings ocrServiceSettings) {this.ocrServiceSettings = ocrServiceSettings;}
@SneakyThrows
public Set<Integer> findPagesToProcess(PDFDoc pdfDoc) {
if (ocrServiceSettings.isProcessAllPages()) {
Set<Integer> pages = new HashSet<>();
for (int i = 1; i <= pdfDoc.getPageCount(); i++) {
pages.add(i);
}
return Collections.unmodifiableSet(pages);
}
return findPagesWithImages(pdfDoc);
}
private Set<Integer> findPagesWithImages(PDFDoc pdfDoc) throws PDFNetException {
Set<Integer> pagesWithImages = new HashSet<>();
try (ElementReader reader = new ElementReader()) {
for (int pageId = 1; pageId <= pdfDoc.getPageCount(); ++pageId) {
reader.begin(pdfDoc.getPage(pageId));
boolean imagesFound = findImagePositionsOnPage(reader);
reader.end();
if (imagesFound) {
pagesWithImages.add(pageId);
}
}
}
return Collections.unmodifiableSet(pagesWithImages);
}
private boolean findImagePositionsOnPage(ElementReader reader) throws PDFNetException {
Element element;
while ((element = reader.next()) != null) {
switch (element.getType()) {
case Element.e_image, Element.e_inline_image -> {
if (element.getImageHeight() > PIXEL_THRESHOLD || element.getImageWidth() > PIXEL_THRESHOLD) {
return true;
}
}
case Element.e_form -> {
reader.formBegin();
findImagePositionsOnPage(reader);
reader.end();
}
}
}
return false;
}
}

View File

@ -0,0 +1,157 @@
package com.knecon.fforesight.service.ocr.processor.service;
import static com.knecon.fforesight.service.ocr.processor.model.Statistics.humanizeDuration;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.util.Set;
import org.springframework.stereotype.Service;
import com.iqser.red.pdftronlogic.commons.InvisibleElementRemovalService;
import com.iqser.red.pdftronlogic.commons.OCGWatermarkRemovalService;
import com.iqser.red.pdftronlogic.commons.WatermarkRemovalService;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.knecon.fforesight.service.ocr.processor.model.Statistics;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingPipeline;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingSupervisor;
import com.knecon.fforesight.service.ocr.processor.visualizations.layers.OcrResult;
import com.knecon.fforesight.service.viewerdoc.service.PDFTronViewerDocumentService;
import com.pdftron.pdf.PDFDoc;
import io.micrometer.observation.annotation.Observed;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OCRService {
IOcrMessageSender ocrMessageSender;
WatermarkRemovalService watermarkRemovalService;
InvisibleElementRemovalService invisibleElementRemovalService;
PDFTronViewerDocumentService viewerDocumentService;
ImageDetectionService imageDetectionService;
AsyncOcrService asyncOcrService;
OcrServiceSettings settings;
ImageProcessingPipeline imageProcessingPipeline;
/**
* Starts the OCR-Process: Collecting images (via threads),
* looking for stitchedImages (if so converting the current page to an image with ghostscript and work on this instead),
* perform tesseract-ocr on these images (via threads) and write the generated ocr-text as invisible elements.
*
* @param dossierId Id of dossier
* @param fileId Id of file
* @param tmpDir working directory for all files
* @param documentFile the file to perform ocr on, results are written invisibly
* @param viewerDocumentFile debugging file, results are written visibly in an optional content group
* @param analyzeResultFile result file with additional information
*/
@Observed(name = "OCRService", contextualName = "run-ocr-on-document")
public void runOcrOnDocument(String dossierId, String fileId, boolean removeWatermark, Path tmpDir, File documentFile, File viewerDocumentFile, File analyzeResultFile) {
if (removeWatermark) {
removeWatermark(documentFile);
}
removeInvisibleElements(documentFile);
log.info("Starting OCR for file {}", fileId);
long ocrStart = System.currentTimeMillis();
Statistics stats = runOcr(tmpDir, documentFile, viewerDocumentFile, fileId, dossierId, analyzeResultFile).getStatistics();
long ocrEnd = System.currentTimeMillis();
log.info("ocr successful for file with dossierId {} and fileId {}, took {}", dossierId, fileId, humanizeDuration(ocrEnd - ocrStart));
if (settings.isDebug()) {
logRuntimeBreakdown(ocrEnd, ocrStart, stats);
}
}
private void logRuntimeBreakdown(long ocrEnd, long ocrStart, Statistics stats) {
log.info("Runtime breakdown: ");
log.info(" total time: {}", humanizeDuration(ocrEnd - ocrStart));
for (String s : stats.toString().split("\n")) {
log.info(" {}", s);
}
}
@SneakyThrows
private void removeInvisibleElements(File documentFile) {
Path tmpFile = Files.createTempFile("invisibleElements", ".pdf");
try (var in = new FileInputStream(documentFile); var out = new FileOutputStream(tmpFile.toFile())) {
invisibleElementRemovalService.removeInvisibleElements(in, out, false, false);
}
Files.copy(tmpFile, documentFile.toPath(), StandardCopyOption.REPLACE_EXISTING);
assert tmpFile.toFile().delete();
}
@SneakyThrows
private void removeWatermark(File originFile) {
Path tmpFile = Files.createTempFile("removeWatermarks", ".pdf");
try (var in = new FileInputStream(originFile); var out = new FileOutputStream(tmpFile.toFile())) {
watermarkRemovalService.removeWatermarks(in, out);
}
Files.copy(tmpFile, originFile.toPath(), StandardCopyOption.REPLACE_EXISTING);
assert tmpFile.toFile().delete();
}
@SneakyThrows
public OcrExecutionSupervisor runOcr(Path tmpDir, File documentFile, File viewerDocumentFile, String fileId, String dossierId, File analyzeResultFile) {
Path tmpImageDir = tmpDir.resolve("images");
Path azureOutputDir = tmpDir.resolve("azure_output");
Files.createDirectories(azureOutputDir);
Files.createDirectories(tmpImageDir);
try (var in = new FileInputStream(documentFile); PDFDoc pdfDoc = new PDFDoc(in)) {
OCGWatermarkRemovalService.removeWatermarks(pdfDoc);
OcrExecutionSupervisor supervisor = new OcrExecutionSupervisor(pdfDoc.getPageCount(), ocrMessageSender, fileId, settings);
supervisor.getStatistics().setStart();
Set<Integer> pagesWithImages = imageDetectionService.findPagesToProcess(pdfDoc);
ImageProcessingSupervisor imageSupervisor = null;
if (settings.isFontStyleDetection()) {
imageSupervisor = imageProcessingPipeline.run(pagesWithImages, tmpImageDir, documentFile);
}
supervisor.logImageExtractionFinished(pdfDoc.getPageCount(), pagesWithImages.size());
OcrResult ocrResult = asyncOcrService.awaitOcr(pdfDoc, supervisor, pagesWithImages, imageSupervisor);
viewerDocumentService.addLayerGroups(documentFile, documentFile, ocrResult.regularLayers());
viewerDocumentService.addLayerGroups(documentFile, viewerDocumentFile, ocrResult.debugLayers());
supervisor.getStatistics().drawingPdfFinished();
supervisor.sendFinished();
return supervisor;
}
}
}

View File

@ -0,0 +1,158 @@
package com.knecon.fforesight.service.ocr.processor.service;
import static com.knecon.fforesight.service.ocr.processor.model.Statistics.humanizeBytes;
import static com.knecon.fforesight.service.ocr.processor.model.Statistics.humanizeDuration;
import java.util.Collections;
import java.util.HashSet;
import java.util.Set;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.CountDownLatch;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.Statistics;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OcrExecutionSupervisor {
@Getter
int totalPageCount;
@Getter
Statistics statistics;
Set<PageBatch> errorPages;
CountDownLatch countDownPagesToProcess;
BlockingQueue<PageBatch> concurrencyCounter;
IOcrMessageSender ocrMessageSender;
String fileId;
public OcrExecutionSupervisor(int totalPageCount, IOcrMessageSender ocrMessageSender, String fileId, OcrServiceSettings settings) {
this.totalPageCount = totalPageCount;
this.ocrMessageSender = ocrMessageSender;
this.fileId = fileId;
this.errorPages = Collections.synchronizedSet(new HashSet<>());
this.countDownPagesToProcess = new CountDownLatch(totalPageCount);
this.statistics = new Statistics();
this.concurrencyCounter = new ArrayBlockingQueue<>(settings.getConcurrency());
}
public void enterConcurrency(PageBatch pageBatch) throws InterruptedException {
concurrencyCounter.put(pageBatch);
}
public void leaveConcurrency(PageBatch pageBatch) {
if (!concurrencyCounter.remove(pageBatch)) {
throw new AssertionError("Batch left concurrency it wasn't in. Should never happen!");
}
}
public void logImageExtractionFinished(int numberOfPages, int numberOfImages) {
statistics.imageExtractionFinished();
log.info("Images found on {}/{} pages in {}", numberOfImages, numberOfPages, humanizeDuration(statistics.getImageExtractionDuration()));
}
public void logUploadStart(PageBatch pageRange, long bytes) {
log.info("Start uploading pages {} with {}", pageRange, humanizeBytes(bytes));
statistics.getBatchStats(pageRange).start();
statistics.increaseTotalBytes(pageRange, bytes);
}
public void logInProgress(PageBatch pageRange) {
if (!statistics.getBatchStats(pageRange).isUploadFinished()) {
log.info("Pages {} is in progress", pageRange);
statistics.getBatchStats(pageRange).finishUpload();
ocrMessageSender.sendUpdate(fileId, processedPages(), getTotalPageCount());
} else {
log.debug("Pages {} still in progress", pageRange);
}
}
public void finishMappingResult(PageBatch pageRange) {
pageRange.forEach(pageIndex -> countDownPagesToProcess.countDown());
statistics.getBatchStats(pageRange).finishWritingText();
ocrMessageSender.sendUpdate(fileId, this.processedPages(), getTotalPageCount());
}
public void logPageSkipped(Integer pageIndex) {
this.countDownPagesToProcess.countDown();
ocrMessageSender.sendUpdate(fileId, this.processedPages(), getTotalPageCount());
log.debug("{}/{}: No images to ocr on page {}", processedPages(), getTotalPageCount(), pageIndex);
}
public void logPageError(PageBatch batch, Throwable e) {
this.errorPages.add(batch);
batch.forEach(pageIndex -> this.countDownPagesToProcess.countDown());
ocrMessageSender.sendUpdate(fileId, this.processedPages(), getTotalPageCount());
log.error("{}/{}: Error occurred on pages {}", processedPages(), getTotalPageCount(), batch, e);
}
public void logPageSuccess(PageBatch batch) {
statistics.getBatchStats(batch).finishApiWait();
log.info("{}/{}: Finished OCR on pages {}", processedPages(), getTotalPageCount(), batch);
}
private int processedPages() {
return (int) (totalPageCount - countDownPagesToProcess.getCount());
}
public void requireNoErrors() {
if (!errorPages.isEmpty()) {
throw new IllegalMonitorStateException(String.format("Errors have occurred on pages %s", errorPages));
}
}
public void sendFinished() {
requireNoErrors();
log.info("{}/{}: Finished OCR on all pages", getTotalPageCount(), getTotalPageCount());
ocrMessageSender.sendOcrFinished(fileId, getTotalPageCount());
}
public void awaitAllPagesProcessed() throws InterruptedException {
countDownPagesToProcess.await();
statistics.ocrFinished();
requireNoErrors();
}
}

View File

@ -0,0 +1,143 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.io.Closeable;
import java.util.Comparator;
import java.util.LinkedList;
import java.util.List;
import java.util.Optional;
import java.util.OptionalDouble;
import org.apache.commons.math3.ml.clustering.Cluster;
import org.apache.commons.math3.ml.clustering.Clusterable;
import org.apache.commons.math3.ml.clustering.DBSCANClusterer;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontStyle;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.Type0FontMetricsProvider;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
import net.sourceforge.lept4j.Pix;
import net.sourceforge.lept4j.util.LeptUtils;
/**
* Implementation of the MOBDoB algorithm, refer to the paper here:
* <a href="http://mile.ee.iisc.ac.in/publications/softCopy/DocumentAnalysis/Sai_NCVPRIPG2013.pdf">Script Independent Detection of Bold Words in Multi Font-size Documents</a>
* <p>
* As a high level overview: We cluster all text based on its font size. We determine the cluster with the most words. This is assumed to be regular text.
* We then estimate the average stroke width of that cluster by thinning all text to a single pixel and calculating the ratio of remaining pixels.
* (<a href="http://www.leptonica.org/papers/conn.pdf">Leptonica Documentation on thinning</a>)
* For each word we scale this average strokewidth based on its fontsize compared to the most common fontsize.
* Using the scaled strokewidth we do an opening operation.
* (<a href="https://en.wikipedia.org/wiki/Opening_(morphology)">Opening (Morphology)</a>).
* We then threshold the ratio of remaining pixels to determine whether a word is bold or not.
* <p>
* I did take some liberties though. Firstly, the paper uses text height without ascender/descender height for the clustering. I'm using the previously implemented font size estimation.
* But that is calculated based on text width. Thus, I'm also using the height scaling factor to scale the font size by the text height.
* The paper does not describe its clustering algorithm, so I've decided on DBSCAN due to its good runtime and readily available implementation by apache commons math.
* Moreover, the paper states that stroke width scales linearly with text height. I've come to the conclusion this is not the case.
* It seems it scales with the square root of the text height. Or at least this seemed to give the best results for me.
*/
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class FontStyleDetector implements Closeable {
static double BOLD_THRESHOLD = 0.5;
StrokeWidthCalculator strokeWidthCalculator = new StrokeWidthCalculator();
List<WordImage> wordImages = new LinkedList<>();
DBSCANClusterer<WordImage> clusterer = new DBSCANClusterer<>(0.5, 1); // at least 0.5 diff in fontsize
public void classifyWords() {
if (wordImages.isEmpty()) {
return;
}
List<Cluster<WordImage>> clusters = clusterer.cluster(wordImages);
if (clusters.size() <= 1) {
// does not make sense to analyze with only a single cluster
return;
}
Optional<Cluster<WordImage>> largestCluster = clusters.stream()
.max(Comparator.comparingInt(cluster -> cluster.getPoints().size()));
List<WordImage> wordsWithMostCommonFontsize = largestCluster.get().getPoints();
OptionalDouble medianFontSize = calculateStandardFontSize(wordsWithMostCommonFontsize);
if (medianFontSize.isEmpty()) {
return;
}
OptionalDouble medianStrokeWidth = calculateRegularStrokeWidth(wordsWithMostCommonFontsize);
if (medianStrokeWidth.isEmpty()) {
return;
}
for (WordImage wordImage : wordImages) {
double scaledStrokeWidth = scaleStrokeWidthByFontSize(wordImage, medianStrokeWidth.getAsDouble(), medianFontSize.getAsDouble());
if (strokeWidthCalculator.hasLargerStrokeWidth(wordImage.pix(), scaledStrokeWidth, BOLD_THRESHOLD)) {
wordImage.textPosition().setFontMetricsProvider(Type0FontMetricsProvider.BOLD_INSTANCE);
wordImage.textPosition().setFontStyle(FontStyle.BOLD);
} else {
wordImage.textPosition().setFontStyle(FontStyle.REGULAR);
}
}
}
public void add(TextPositionInImage textPosition, Pix wordImage, double fontsize) {
wordImages.add(new WordImage(textPosition, wordImage, fontsize));
}
private static double scaleStrokeWidthByFontSize(WordImage textPositionsAndWordImage, double standardStrokeWidth, double standardFontSize) {
double influenceOfFontSize = 1.0; // the paper states that stroke width scales exactly linearly with font size. This did not seem to be true for me. Maybe some of the preprocessing steps are affecting this.
double fontsizeScalingFactor = Math.sqrt(textPositionsAndWordImage.fontsize() / standardFontSize);
return standardStrokeWidth + (influenceOfFontSize * (fontsizeScalingFactor - 1) * standardStrokeWidth);
}
private static OptionalDouble calculateStandardFontSize(List<WordImage> wordsWithMostCommonFontsize) {
return wordsWithMostCommonFontsize.stream()
.mapToDouble(WordImage::fontsize)
.filter(Double::isFinite).average();
}
private OptionalDouble calculateRegularStrokeWidth(List<WordImage> wordsWithMostCommonFontsize) {
return wordsWithMostCommonFontsize.stream()
.mapToDouble(wordImage -> strokeWidthCalculator.calculate(wordImage.pix))
.filter(Double::isFinite).average();
}
public record WordImage(TextPositionInImage textPosition, Pix pix, double fontsize) implements Clusterable {
@Override
public double[] getPoint() {
return new double[]{fontsize};
}
}
@Override
public void close() {
strokeWidthCalculator.close();
wordImages.stream()
.map(WordImage::pix)
.forEach(LeptUtils::disposePix);
}
}

View File

@ -0,0 +1,135 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.io.BufferedReader;
import java.io.File;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Map;
import java.util.function.Consumer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@RequiredArgsConstructor
@FieldDefaults(level = AccessLevel.PRIVATE)
public class GhostScriptOutputHandler extends Thread {
static Pattern pageFinishedPattern = Pattern.compile("Page (\\d+)");
// If the stdError or stdOut buffer of a thread is not being emptied it might lock the process in case of errors, so we need to empty both streams to prevent a deadlock.
// Since both need to read simultaneously we need to implement the readers as separate threads.
final InputStream is;
final String processName;
final Type type;
final Map<Integer, ImageFile> pagesToProcess;
final Consumer<ImageFile> outputHandler;
final Consumer<String> errorHandler;
int currentPageNumber;
public static GhostScriptOutputHandler stdError(InputStream is, Consumer<String> errorHandler) {
return new GhostScriptOutputHandler(is, "GS", Type.ERROR, null, null, errorHandler);
}
public static GhostScriptOutputHandler stdOut(InputStream is, Map<Integer, ImageFile> pagesToProcess, Consumer<ImageFile> imageFileOutput, Consumer<String> errorHandler) {
return new GhostScriptOutputHandler(is, "GS", Type.STD_OUT, pagesToProcess, imageFileOutput, errorHandler);
}
@SneakyThrows
public void run() {
try (InputStreamReader isr = new InputStreamReader(is); BufferedReader br = new BufferedReader(isr)) {
String line;
while (true) {
line = br.readLine();
if (line == null) {
break;
}
if (type.equals(Type.ERROR)) {
log.error("{}_{}>{}", processName, type.name(), line);
} else {
log.debug("{}_{}>{}", processName, type.name(), line);
addProcessedImageToQueue(line);
}
}
}
is.close();
if (type.equals(Type.STD_OUT)) {
queueFinishedPage(currentPageNumber);
if (!pagesToProcess.isEmpty()) {
errorHandler.accept(String.format("Ghostscript finished for batch, but pages %s remain unprocessed.", formatPagesToProcess()));
}
}
}
private String formatPagesToProcess() {
var pages = new PageBatch();
pagesToProcess.keySet()
.forEach(pages::add);
return pages.toString();
}
private void addProcessedImageToQueue(String line) {
/*
Ghostscript prints the pageNumber it is currently working on, so we remember the current page and queue it as soon as the next comes in.
*/
Matcher pageNumberMatcher = pageFinishedPattern.matcher(line);
if (pageNumberMatcher.find()) {
int pageNumber = Integer.parseInt(pageNumberMatcher.group(1));
if (currentPageNumber == 0) {
currentPageNumber = pageNumber;
return;
}
queueFinishedPage(currentPageNumber);
currentPageNumber = pageNumber;
}
}
private void queueFinishedPage(int pageNumber) {
var imageFile = this.pagesToProcess.remove(pageNumber);
if (imageFile == null) {
errorHandler.accept(String.format("Page number %d does not exist in this thread. It only has pagenumbers %s", pageNumber, pagesToProcess.keySet()));
} else {
if (!new File(imageFile.absoluteFilePath()).exists()) {
errorHandler.accept(String.format("Rendered page with number %d does not exist!", pageNumber));
}
}
outputHandler.accept(imageFile);
}
public enum Type {
ERROR,
STD_OUT
}
}

View File

@ -0,0 +1,165 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.io.InputStream;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.function.Consumer;
import java.util.stream.Collectors;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import com.knecon.fforesight.service.ocr.processor.utils.ListSplittingUtils;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
@SuppressWarnings("PMD") // can't figure out how to safely close the stdOut and stdError streams in line 142/144
public class GhostScriptService {
public static final int BATCH_SIZE = 256;
static String FORMAT = ".tiff";
static String DEVICE = "tiffgray";
static int DPI = 300;
static int PROCESS_COUNT = 1;
@SneakyThrows
public void renderPagesBatched(List<Integer> pagesToProcess,
String documentAbsolutePath,
Path tmpImageDir,
ImageProcessingSupervisor supervisor,
Consumer<ImageFile> successHandler,
Consumer<String> errorHandler) {
List<List<ProcessInfo>> processInfoBatches = buildSubListForEachProcess(pagesToProcess,
PROCESS_COUNT,
BATCH_SIZE
* PROCESS_COUNT); // GS has a limit on how many pageIndices per call are possible, so we limit it to 256 pages per process
for (int batchIdx = 0; batchIdx < processInfoBatches.size(); batchIdx++) {
supervisor.requireNoErrors();
List<ProcessInfo> processInfos = processInfoBatches.get(batchIdx);
log.info("Batch {}: Running {} gs processes with ({}) pages each",
batchIdx,
processInfos.size(),
processInfos.stream()
.map(info -> info.pageNumbers().size())
.map(String::valueOf)
.collect(Collectors.joining(", ")));
int finalBatchIdx = batchIdx;
List<Process> processes = processInfos.stream()
.parallel()
.map(info -> buildCmdArgs(info.processIdx(), finalBatchIdx, info.pageNumbers(), tmpImageDir, documentAbsolutePath))
.peek(s -> log.debug(String.join(" ", s.cmdArgs())))
.map(processInfo -> executeProcess(processInfo, successHandler, errorHandler))
.toList();
List<Integer> processExitCodes = new LinkedList<>();
for (Process process : processes) {
processExitCodes.add(process.waitFor());
}
log.info("Batch {}: Ghostscript processes finished with exit codes {}", batchIdx, processExitCodes);
}
}
private List<List<ProcessInfo>> buildSubListForEachProcess(List<Integer> stitchedPageNumbers, int processCount, int batchSize) {
// GhostScript command line can only handle so many page numbers at once, so we split it into batches
int batchCount = (int) Math.ceil((double) stitchedPageNumbers.size() / batchSize);
log.info("Splitting {} page renderings across {} process(es) in {} batch(es) with size {}", stitchedPageNumbers.size(), processCount, batchCount, batchSize);
List<List<ProcessInfo>> processInfoBatches = new ArrayList<>(batchCount);
List<List<List<Integer>>> batchedBalancedSublist = ListSplittingUtils.buildBatchedBalancedSublist(stitchedPageNumbers.stream()
.sorted()
.toList(), processCount, batchCount);
for (var batch : batchedBalancedSublist) {
List<ProcessInfo> processInfos = new ArrayList<>(processCount);
for (int threadIdx = 0; threadIdx < batch.size(); threadIdx++) {
List<Integer> balancedPageNumbersSubList = batch.get(threadIdx);
processInfos.add(new ProcessInfo(threadIdx, balancedPageNumbersSubList));
}
processInfoBatches.add(processInfos);
}
return processInfoBatches;
}
@SneakyThrows
private ProcessCmdsAndRenderedImageFiles buildCmdArgs(Integer processIdx,
Integer batchIdx,
List<Integer> stitchedImagePageIndices,
Path outputDir,
String documentAbsolutePath) {
String imagePathFormat = outputDir.resolve("output_" + processIdx + "_" + batchIdx + ".%04d" + FORMAT).toFile().toString();
Map<Integer, ImageFile> fullPageImages = new HashMap<>();
for (int i = 0; i < stitchedImagePageIndices.size(); i++) {
Integer pageNumber = stitchedImagePageIndices.get(i);
fullPageImages.put(pageNumber, new ImageFile(pageNumber, String.format(imagePathFormat, i + 1)));
}
String[] cmdArgs = buildCmdArgs(stitchedImagePageIndices, documentAbsolutePath, imagePathFormat);
return new ProcessCmdsAndRenderedImageFiles(cmdArgs, fullPageImages);
}
private String[] buildCmdArgs(List<Integer> pageNumbers, String documentAbsolutePath, String imagePathFormat) {
StringBuilder sPageList = new StringBuilder();
int i = 1;
for (Integer integer : pageNumbers) {
sPageList.append(integer);
if (i < pageNumbers.size()) {
sPageList.append(",");
}
i++;
}
return new String[]{"gs", "-dNOPAUSE", "-sDEVICE=" + DEVICE, "-r" + DPI, "-sPageList=" + sPageList, "-sOutputFile=" + imagePathFormat, documentAbsolutePath, "-c", "quit"};
}
@SneakyThrows
private Process executeProcess(ProcessCmdsAndRenderedImageFiles processInfo, Consumer<ImageFile> successHandler, Consumer<String> errorHandler) {
Process p = Runtime.getRuntime().exec(processInfo.cmdArgs());
InputStream stdOut = p.getInputStream();
GhostScriptOutputHandler stdOutLogger = GhostScriptOutputHandler.stdOut(stdOut, processInfo.renderedPageImageFiles(), successHandler, errorHandler);
InputStream stdError = p.getErrorStream();
GhostScriptOutputHandler stdErrorLogger = GhostScriptOutputHandler.stdError(stdError, errorHandler);
stdOutLogger.start();
stdErrorLogger.start();
return p;
}
private record ProcessCmdsAndRenderedImageFiles(String[] cmdArgs, Map<Integer, ImageFile> renderedPageImageFiles) {
}
private record ProcessInfo(Integer processIdx, List<Integer> pageNumbers) {
}
}

View File

@ -0,0 +1,51 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.Set;
import java.util.function.Consumer;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class ImageProcessingPipeline {
GhostScriptService ghostScriptService;
ImageProcessingService imageProcessingService;
@SneakyThrows
public ImageProcessingSupervisor run(Set<Integer> pageNumberSet, Path imageDir, File document) {
Path processedImageDir = imageDir.resolve("processed");
Path renderedImageDir = imageDir.resolve("rendered");
Files.createDirectories(renderedImageDir);
Files.createDirectories(processedImageDir);
List<Integer> pageNumbers = pageNumberSet.stream()
.sorted()
.toList();
ImageProcessingSupervisor supervisor = new ImageProcessingSupervisor(pageNumbers);
Consumer<ImageFile> renderingSuccessConsumer = imageFile -> imageProcessingService.addToProcessingQueue(imageFile, processedImageDir, supervisor);
Consumer<String> renderingErrorConsumer = supervisor::markError;
ghostScriptService.renderPagesBatched(pageNumbers, document.toString(), renderedImageDir, supervisor, renderingSuccessConsumer, renderingErrorConsumer);
return supervisor;
}
}

View File

@ -0,0 +1,108 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.nio.file.Path;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import lombok.AccessLevel;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
import net.sourceforge.lept4j.ILeptonica;
import net.sourceforge.lept4j.Leptonica1;
import net.sourceforge.lept4j.Pix;
import net.sourceforge.lept4j.util.LeptUtils;
@Slf4j
@Service
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class ImageProcessingService {
BlockingQueue<ProcessParams> queue = new LinkedBlockingQueue<>();
public ImageProcessingService() {
Thread queueConsumerThread = new Thread(() -> {
while (true) {
ProcessParams processParams;
try {
processParams = queue.take();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
try {
process(processParams.unprocessedImage(), processParams.outputDir, processParams.supervisor());
} catch (Exception e) {
log.error(e.getMessage(), e);
}
}
});
queueConsumerThread.start();
}
@SneakyThrows
public void addToProcessingQueue(ImageFile unprocessedImage, Path outputDir, ImageProcessingSupervisor supervisor) {
queue.put(new ProcessParams(unprocessedImage, outputDir, supervisor));
}
@SneakyThrows
private void process(ImageFile unprocessedImage, Path outputDir, ImageProcessingSupervisor supervisor) {
supervisor.requireNoErrors();
synchronized (ImageProcessingSupervisor.class) {
// Leptonica is not thread safe, but is being called in WritableOcrResultFactory as well
Pix processedPix;
Pix pix = unprocessedImage.readPix();
String absoluteFilePath = outputDir.resolve(Path.of(unprocessedImage.absoluteFilePath()).getFileName()).toFile().toString();
processedPix = processPix(pix);
Leptonica1.pixWrite(absoluteFilePath, processedPix, ILeptonica.IFF_TIFF_PACKBITS);
LeptUtils.disposePix(pix);
LeptUtils.disposePix(processedPix);
ImageFile imageFile = new ImageFile(unprocessedImage.pageNumber(), absoluteFilePath);
supervisor.markPageFinished(imageFile);
}
}
@SneakyThrows
private Pix processPix(Pix pix) {
Pix binarized;
if (pix.d != 8) {
throw new UnsupportedOperationException(String.format("Unexpected pix format with bpp of %d", pix.d));
}
// Threshold to binary
if (pix.w < 100 || pix.h < 100) {
binarized = Leptonica1.pixThresholdToBinary(pix, 170);
} else {
binarized = Leptonica1.pixOtsuThreshOnBackgroundNorm(pix, null, 50, 50, 165, 10, 100, 5, 5, 0.1f, null);
if (binarized == null) { // Sometimes Otsu just fails, then we binarize directly
binarized = Leptonica1.pixThresholdToBinary(pix, 170);
}
}
return binarized;
}
private record ProcessParams(ImageFile unprocessedImage, Path outputDir, ImageProcessingSupervisor supervisor) {
}
}

View File

@ -0,0 +1,92 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.CountDownLatch;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@RequiredArgsConstructor
@FieldDefaults(level = AccessLevel.PRIVATE)
public class ImageProcessingSupervisor {
final Map<Integer, CountDownLatch> pageLatches;
final Map<Integer, ImageFile> images;
final List<String> errors = new ArrayList<>();
public ImageProcessingSupervisor(List<Integer> pageNumbers) {
this.pageLatches = Collections.synchronizedMap(new HashMap<>());
this.images = Collections.synchronizedMap(new HashMap<>());
for (Integer pageNumber : pageNumbers) {
pageLatches.put(pageNumber, new CountDownLatch(1));
}
}
public void markPageFinished(ImageFile imageFile) {
log.debug("finished page: {}", imageFile.pageNumber());
getPageLatch(imageFile.pageNumber()).countDown();
images.put(imageFile.pageNumber(), imageFile);
}
private CountDownLatch getPageLatch(Integer pageNumber) {
if (pageNumber == null || !pageLatches.containsKey(pageNumber)) {
throw new IllegalArgumentException("awaiting non-existent page " + pageNumber);
}
return pageLatches.get(pageNumber);
}
public ImageFile awaitProcessedPage(Integer pageNumber) throws InterruptedException {
if (hasErros()) {
return null;
}
getPageLatch(pageNumber).await();
return images.get(pageNumber);
}
private boolean hasErros() {
return errors.isEmpty();
}
public void markError(String errorMessage) {
this.errors.add(errorMessage);
}
public void awaitAll() throws InterruptedException {
for (CountDownLatch countDownLatch : pageLatches.values()) {
countDownLatch.await();
}
}
public void requireNoErrors() {
// GS will log
if (this.errors.isEmpty()) {
return;
}
throw new IllegalStateException("Error(s) occurred during image processing: " + String.join("\n", errors));
}
}

View File

@ -0,0 +1,85 @@
package com.knecon.fforesight.service.ocr.processor.service.imageprocessing;
import static net.sourceforge.lept4j.ILeptonica.L_THIN_FG;
import java.io.Closeable;
import java.io.IOException;
import java.nio.IntBuffer;
import org.springframework.stereotype.Service;
import lombok.AccessLevel;
import lombok.NoArgsConstructor;
import lombok.experimental.FieldDefaults;
import net.sourceforge.lept4j.Leptonica1;
import net.sourceforge.lept4j.Pix;
import net.sourceforge.lept4j.Sela;
import net.sourceforge.lept4j.util.LeptUtils;
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class StrokeWidthCalculator implements Closeable {
Sela thinningSel = Leptonica1.selaMakeThinSets(1, 0);
/**
* Uses a series of sels to thin all connected lines to a single pixel. Then the pixel ratio is a good estimation of the stroke width in pixels.
* <a href="http://www.leptonica.org/papers/conn.pdf">Leptonica Documentation on thinning</a>
* Since the baseline is a strokewidth of exactly one, we need to add 1 to the result.
*
* @param input binarized pix with text on it
* @return estimated stroke width in pixels
*/
public double calculate(Pix input) {
Pix thinned = Leptonica1.pixThinConnectedBySet(input, L_THIN_FG, thinningSel, 0);
IntBuffer thinnedPixelCount = IntBuffer.allocate(1);
Leptonica1.pixCountPixels(thinned, thinnedPixelCount, null);
IntBuffer pixelCount = IntBuffer.allocate(1);
Leptonica1.pixCountPixels(input, pixelCount, null);
LeptUtils.disposePix(thinned);
return (double) pixelCount.get() / thinnedPixelCount.get() + 1;
}
public boolean hasLargerStrokeWidth(Pix pix, double strokeWidth, double threshold) {
int roundedStrokeWidth = (int) Math.round(strokeWidth);
double roundingError = (roundedStrokeWidth - strokeWidth) / strokeWidth;
// add 1 to open a bit bigger than the estimated regular stroke width
Pix openedPix = Leptonica1.pixOpenBrick(null, pix, roundedStrokeWidth + 1, roundedStrokeWidth + 1);
double openedPixelDensity = calculatePixelDensity(openedPix);
double pixelDensity = calculatePixelDensity(pix);
LeptUtils.disposePix(openedPix);
return (openedPixelDensity * (1 + roundingError)) / pixelDensity > threshold;
}
private static double calculatePixelDensity(Pix pix) {
IntBuffer pixelCount = IntBuffer.allocate(1);
int result = Leptonica1.pixCountPixels(pix, pixelCount, null);
if (result == 0) {
return (double) pixelCount.get() / (pix.h * pix.w);
} else {
return -1;
}
}
@Override
public void close() {
LeptUtils.dispose(thinningSel);
}
}

View File

@ -0,0 +1,54 @@
package com.knecon.fforesight.service.ocr.processor.utils;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.List;
import com.iqser.red.pdftronlogic.commons.Converter;
import com.pdftron.pdf.Page;
import com.pdftron.pdf.TextExtractor;
import lombok.experimental.UtilityClass;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@UtilityClass
@SuppressWarnings("PMD.CloseResource") // using try-with-resources to ensure words and lines are closed bloats this code immensely
public class DocumentTextExtractor {
public List<Rectangle2D> getTextBBoxes(Page page) {
List<Rectangle2D> textBBoxes = new ArrayList<>();
try (var textExtractor = new TextExtractor()) {
textExtractor.begin(page);
try {
for (TextExtractor.Line line = textExtractor.getFirstLine(); line.isValid(); line = getNextLine(line)) {
for (TextExtractor.Word word = line.getFirstWord(); word.isValid(); word = getNextWord(word)) {
textBBoxes.add(Converter.toRectangle2D(word.getBBox()));
}
}
} catch (Exception e) {
log.warn("Could not get word dimension, {}", e.getMessage());
}
return textBBoxes;
}
}
private static TextExtractor.Word getNextWord(TextExtractor.Word word) {
TextExtractor.Word nextWord = word.getNextWord();
word.close();
return nextWord;
}
private static TextExtractor.Line getNextLine(TextExtractor.Line line) {
TextExtractor.Line newLine = line.getNextLine();
line.close();
return newLine;
}
}

View File

@ -0,0 +1,106 @@
package com.knecon.fforesight.service.ocr.processor.utils;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.stream.IntStream;
import lombok.experimental.UtilityClass;
@UtilityClass
public class ListSplittingUtils {
public List<List<Integer>> buildBalancedContinuousSublist(Integer totalNumberOfEntries, int threadCount) {
return buildBalancedSublist(IntStream.range(0, totalNumberOfEntries)
.map(i -> i + 1).boxed()
.toList(), threadCount);
}
public <T> List<List<T>> buildBalancedSublist(List<T> entries, int threadCount) {
List<Integer> balancedEntryCounts = buildBalancedEntryCounts(entries.size(), threadCount);
List<List<T>> balancedSublist = new ArrayList<>(threadCount);
int startIdx = 0;
for (Integer numberOfEntriesPerThread : balancedEntryCounts) {
balancedSublist.add(entries.subList(startIdx, startIdx + numberOfEntriesPerThread));
startIdx += numberOfEntriesPerThread;
}
return balancedSublist;
}
public <T> List<List<List<T>>> buildBatchedBalancedSublist(List<T> entries, int threadCount, int batchSize) {
// batches -> threads -> entries
List<List<List<T>>> batchedBalancedSubList = new LinkedList<>();
List<List<List<T>>> threadsWithBatches = buildBalancedSublist(entries, threadCount).stream()
.map(list -> buildBalancedSublist(list, batchSize))
.toList();
// swap first two dimensions
for (int batchIdx = 0; batchIdx < batchSize; batchIdx++) {
List<List<T>> threadEntriesPerBatch = new ArrayList<>(threadCount);
for (int threadIdx = 0; threadIdx < threadCount; threadIdx++) {
threadEntriesPerBatch.add(threadsWithBatches.get(threadIdx).get(batchIdx));
}
batchedBalancedSubList.add(threadEntriesPerBatch);
}
return batchedBalancedSubList;
}
public List<Integer> buildBalancedEntryCounts(int totalNumberOfEntries, int threadCount) {
List<Integer> numberOfPagesPerThread = new ArrayList<>(threadCount);
for (int i = 0; i < threadCount; i++) {
numberOfPagesPerThread.add(0);
}
int threadIdx;
for (int i = 0; i < totalNumberOfEntries; i++) {
threadIdx = i % threadCount;
numberOfPagesPerThread.set(threadIdx, numberOfPagesPerThread.get(threadIdx) + 1);
}
return numberOfPagesPerThread;
}
public static List<String> formatIntervals(List<Integer> sortedList) {
List<String> intervals = new ArrayList<>();
if (sortedList.isEmpty()) {
return intervals;
}
int start = sortedList.get(0);
int end = start;
for (int i = 1; i < sortedList.size(); i++) {
int current = sortedList.get(i);
if (current == end + 1) {
end = current;
} else {
intervals.add(formatInterval(start, end));
start = current;
end = start;
}
}
intervals.add(formatInterval(start, end));
return intervals;
}
private static String formatInterval(int start, int end) {
if (start == end) {
return String.valueOf(start);
} else {
return start + "-" + end;
}
}
}

View File

@ -0,0 +1,28 @@
package com.knecon.fforesight.service.ocr.processor.utils;
import org.apache.commons.lang3.StringUtils;
import lombok.experimental.UtilityClass;
@UtilityClass
public final class OsUtils {
private static final String SERVICE_NAME = "azure-ocr-service";
private static boolean isWindows() {
return StringUtils.containsIgnoreCase(System.getProperty("os.name"), "Windows");
}
public static String getTemporaryDirectory() {
String tmpdir = System.getProperty("java.io.tmpdir");
if (isWindows() && StringUtils.isNotBlank(tmpdir)) {
return tmpdir;
}
return "/tmp";
}
}

View File

@ -0,0 +1,16 @@
package com.knecon.fforesight.service.ocr.processor.visualizations;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.knecon.fforesight.service.ocr.v1.api.model.AzureAnalyzeResult;
import lombok.experimental.UtilityClass;
@UtilityClass
public class AnalyzeResultMapper {
public AzureAnalyzeResult map(AnalyzeResult analyzeResult) {
return null;
}
}

View File

@ -0,0 +1,28 @@
package com.knecon.fforesight.service.ocr.processor.visualizations;
import java.awt.geom.Line2D;
import java.util.Collections;
import java.util.List;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.v1.api.model.QuadPoint;
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Getter;
import lombok.experimental.FieldDefaults;
@Getter
@Builder
@AllArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public final class WritableOcrResult {
int pageNumber;
@Builder.Default
List<TextPositionInImage> textPositionInImage = Collections.emptyList();
@Builder.Default
List<Line2D> tableLines = Collections.emptyList();
}

View File

@ -0,0 +1,367 @@
package com.knecon.fforesight.service.ocr.processor.visualizations;
import java.awt.geom.AffineTransform;
import java.awt.geom.Line2D;
import java.awt.geom.Rectangle2D;
import java.nio.IntBuffer;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import java.util.function.Function;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.azure.ai.documentintelligence.models.BoundingRegion;
import com.azure.ai.documentintelligence.models.DocumentPage;
import com.azure.ai.documentintelligence.models.DocumentSpan;
import com.azure.ai.documentintelligence.models.DocumentStyle;
import com.azure.ai.documentintelligence.models.DocumentTable;
import com.azure.ai.documentintelligence.models.DocumentTableCell;
import com.azure.ai.documentintelligence.models.DocumentWord;
import com.azure.ai.documentintelligence.models.FontWeight;
import com.google.common.base.Functions;
import com.knecon.fforesight.service.ocr.processor.model.ImageFile;
import com.knecon.fforesight.service.ocr.processor.model.PageInformation;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.SpanLookup;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.FontStyleDetector;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingSupervisor;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontMetricsProvider;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontStyle;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.Type0FontMetricsProvider;
import com.knecon.fforesight.service.ocr.v1.api.model.QuadPoint;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.SneakyThrows;
import lombok.experimental.FieldDefaults;
import net.sourceforge.lept4j.Box;
import net.sourceforge.lept4j.Leptonica1;
import net.sourceforge.lept4j.Pix;
import net.sourceforge.lept4j.util.LeptUtils;
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class WritableOcrResultFactory {
FontMetricsProvider regularFont = Type0FontMetricsProvider.REGULAR_INSTANCE;
FontMetricsProvider boldFont = Type0FontMetricsProvider.BOLD_INSTANCE;
FontMetricsProvider italicFont = Type0FontMetricsProvider.ITALIC_INSTANCE;
FontMetricsProvider boldItalicFont = Type0FontMetricsProvider.BOLD_ITALIC_INSTANCE;
@Getter
Map<Integer, AffineTransform> pageCtms;
Map<Integer, PageInformation> pageInformation;
OcrServiceSettings settings;
ImageProcessingSupervisor imageSupervisor;
@SneakyThrows
public WritableOcrResultFactory(Map<Integer, PageInformation> pageInformation, OcrServiceSettings settings, ImageProcessingSupervisor imageSupervisor) {
this.pageInformation = pageInformation;
pageCtms = Collections.synchronizedMap(new HashMap<>());
this.settings = settings;
this.imageSupervisor = imageSupervisor;
}
public List<WritableOcrResult> buildOcrResultToWrite(AnalyzeResult analyzeResult, PageBatch pageOffset) throws InterruptedException {
List<WritableOcrResult> writableOcrResultList = new ArrayList<>();
Lookups lookups = getLookups(analyzeResult);
for (DocumentPage resultPage : analyzeResult.getPages()) {
PageInformation pageInformation = getPageInformation(getPageNumber(pageOffset, resultPage));
AffineTransform pageCtm = getPageCTM(pageInformation, resultPage.getWidth());
pageCtms.put(getPageNumber(pageOffset, resultPage), pageCtm);
List<TextPositionInImage> words = buildTextPositionsInImage(pageOffset, resultPage, pageCtm, lookups, pageInformation);
var builder = WritableOcrResult.builder().pageNumber(pageInformation.number()).textPositionInImage(words);
if (settings.isTableDetection()) {
builder.tableLines(getTableLines(analyzeResult, pageInformation, pageCtm));
}
writableOcrResultList.add(builder.build());
}
return writableOcrResultList;
}
private List<TextPositionInImage> buildTextPositionsInImage(PageBatch pageOffset,
DocumentPage resultPage,
AffineTransform pageCtm,
Lookups lookups,
PageInformation pageInformation) throws InterruptedException {
if (!settings.isFontStyleDetection()) {
return buildText(resultPage, pageCtm, lookups, pageInformation);
}
ImageFile imageFile = imageSupervisor.awaitProcessedPage(getPageNumber(pageOffset, resultPage));
if (imageFile == null) {
return buildText(resultPage, pageCtm, lookups, pageInformation);
}
synchronized (ImageProcessingSupervisor.class) {
return buildTextWithBoldDetection(resultPage, pageCtm, pageInformation, imageFile);
}
}
private static List<TextPositionInImage> buildTextWithBoldDetection(DocumentPage resultPage, AffineTransform pageCtm, PageInformation pageInformation, ImageFile imageFile) {
// Leptonica is not thread safe, but is being called in ImageProcessingService as well
Pix pageImage = imageFile.readPix();
List<TextPositionInImage> words = new ArrayList<>();
try (FontStyleDetector fontStyleDetector = new FontStyleDetector()) {
AffineTransform imageTransform = new AffineTransform();
double scalingFactor = pageImage.w / resultPage.getWidth();
imageTransform.scale(scalingFactor, scalingFactor);
for (DocumentWord word : resultPage.getWords()) {
TextPositionInImage textPosition = new TextPositionInImage(word, pageCtm, Type0FontMetricsProvider.REGULAR_INSTANCE, FontStyle.REGULAR);
if (intersectsIgnoreZone(pageInformation.wordBBoxes(), textPosition)) {
textPosition.setOverlapsIgnoreZone(true);
}
Pix wordImage = extractWordImage(word, imageTransform, pageImage);
IntBuffer pixelCount = IntBuffer.allocate(1);
Leptonica1.pixCountPixels(wordImage, pixelCount, null);
if (pixelCount.get(0) > 3) {
fontStyleDetector.add(textPosition, wordImage, textPosition.getFontSizeByHeight());
}
words.add(textPosition);
}
fontStyleDetector.classifyWords();
} finally {
LeptUtils.disposePix(pageImage);
}
return words;
}
private static Pix extractWordImage(DocumentWord word, AffineTransform imageTransform, Pix pageImage) {
Rectangle2D wordBBox = QuadPoint.fromPolygons(word.getPolygon()).getTransformed(imageTransform).getBounds2D();
Box box = new Box((int) wordBBox.getX(), (int) wordBBox.getY(), (int) wordBBox.getWidth(), (int) wordBBox.getHeight(), 1);
Pix wordImage = Leptonica1.pixClipRectangle(pageImage, box, null);
box.clear();
return wordImage;
}
private List<TextPositionInImage> buildText(DocumentPage resultPage, AffineTransform pageCtm, Lookups lookups, PageInformation pageInformation) {
return resultPage.getWords()
.stream()
.map(word -> buildTextPositionInImage(word, pageCtm, lookups))
.map(textPositionInImage -> markTextOverlappingIgnoreZone(textPositionInImage, pageInformation.wordBBoxes()))
.collect(Collectors.toList());
}
private static int getPageNumber(PageBatch pageOffset, DocumentPage resultPage) {
return pageOffset.getPageNumber(resultPage.getPageNumber());
}
private static Lookups getLookups(AnalyzeResult analyzeResult) {
if (analyzeResult.getStyles() == null || analyzeResult.getStyles().isEmpty()) {
return Lookups.empty();
}
SpanLookup<DocumentSpan> boldLookup = new SpanLookup<>(analyzeResult.getStyles()
.stream()
.filter(style -> Objects.equals(style.getFontWeight(), FontWeight.BOLD))
.map(DocumentStyle::getSpans)
.flatMap(Collection::stream), Function.identity());
SpanLookup<DocumentSpan> italicLookup = new SpanLookup<>(analyzeResult.getStyles()
.stream()
.filter(style -> Objects.equals(style.getFontStyle(),
com.azure.ai.documentintelligence.models.FontStyle.ITALIC))
.map(DocumentStyle::getSpans)
.flatMap(Collection::stream), Functions.identity());
SpanLookup<DocumentSpan> handWrittenLookup = new SpanLookup<>(analyzeResult.getStyles()
.stream()
.filter(documentStyle -> documentStyle.isHandwritten() != null && documentStyle.isHandwritten())
.map(DocumentStyle::getSpans)
.flatMap(Collection::stream), Functions.identity());
return new Lookups(boldLookup, italicLookup, handWrittenLookup);
}
private TextPositionInImage buildTextPositionInImage(DocumentWord dw, AffineTransform imageCTM, Lookups lookups) {
boolean bold = lookups.bold().containedInAnySpan(dw.getSpan());
boolean italic = lookups.italic().containedInAnySpan(dw.getSpan());
boolean handwritten = lookups.handwritten().containedInAnySpan(dw.getSpan());
FontStyle fontStyle;
FontMetricsProvider font;
if (handwritten) {
fontStyle = FontStyle.HANDWRITTEN;
font = regularFont;
} else if (italic && bold) {
fontStyle = FontStyle.BOLD_ITALIC;
font = boldItalicFont;
} else if (bold) {
fontStyle = FontStyle.BOLD;
font = boldFont;
} else if (italic) {
fontStyle = FontStyle.ITALIC;
font = italicFont;
} else {
fontStyle = FontStyle.REGULAR;
font = regularFont;
}
return new TextPositionInImage(dw, imageCTM, font, fontStyle);
}
private static List<Line2D> getTableLines(AnalyzeResult analyzeResult, PageInformation pageInformation, AffineTransform imageCTM) {
if (analyzeResult.getTables() == null || analyzeResult.getTables().isEmpty()) {
return Collections.emptyList();
}
return analyzeResult.getTables()
.stream()
.map(DocumentTable::getCells)
.flatMap(Collection::stream)
.map(DocumentTableCell::getBoundingRegions)
.flatMap(Collection::stream)
.filter(table -> table.getPageNumber() == pageInformation.number())
.map(BoundingRegion::getPolygon)
.map(QuadPoint::fromPolygons)
.map(qp -> qp.getTransformed(imageCTM))
.flatMap(QuadPoint::asLines)
.toList();
}
private static TextPositionInImage markTextOverlappingIgnoreZone(TextPositionInImage textPositionInImage, List<Rectangle2D> ignoreZones) {
if (intersectsIgnoreZone(ignoreZones, textPositionInImage)) {
textPositionInImage.setOverlapsIgnoreZone(true);
}
return textPositionInImage;
}
private static boolean intersectsIgnoreZone(List<Rectangle2D> ignoreZones, TextPositionInImage textPositionInImage) {
for (Rectangle2D ignoreZone : ignoreZones) {
Rectangle2D textBBox = textPositionInImage.getTransformedTextBBox().getBounds2D();
if (textBBox.intersects(ignoreZone)) {
double intersectedArea = calculateIntersectedArea(textBBox, ignoreZone);
double textArea = textBBox.getWidth() * textBBox.getHeight();
if (intersectedArea / textArea > 0.5) {
return true;
}
double ignoreZoneArea = ignoreZone.getWidth() * ignoreZone.getHeight();
if (intersectedArea / ignoreZoneArea > 0.5) {
return true;
}
}
}
return false;
}
public static double calculateIntersectedArea(Rectangle2D r1, Rectangle2D r2) {
double xOverlap = Math.max(0, Math.min(r1.getMaxX(), r2.getMaxX()) - Math.max(r1.getMinX(), r2.getMinX()));
double yOverlap = Math.max(0, Math.min(r1.getMaxY(), r2.getMaxY()) - Math.max(r1.getY(), r2.getY()));
return xOverlap * yOverlap;
}
public static AffineTransform getPageCTM(PageInformation pageInformation, double imageWidth) {
double scalingFactor = calculateScalingFactor(imageWidth, pageInformation);
AffineTransform imageToCropBoxScaling = new AffineTransform(scalingFactor, 0, 0, scalingFactor, 0, 0);
AffineTransform mirrorMatrix = new AffineTransform(1, 0, 0, -1, 0, pageInformation.height());
AffineTransform rotationMatrix = switch (pageInformation.rotationDegrees()) {
case 90 -> new AffineTransform(0, 1, -1, 0, pageInformation.height(), 0);
case 180 -> new AffineTransform(-1, 0, 0, -1, pageInformation.width(), pageInformation.height());
case 270 -> new AffineTransform(0, -1, 1, 0, pageInformation.width() - pageInformation.height(), pageInformation.height());
default -> new AffineTransform();
};
// matrix multiplication is performed from right to left, so the order is reversed.
// scaling -> mirror -> rotation
AffineTransform resultMatrix = new AffineTransform();
resultMatrix.concatenate(rotationMatrix);
resultMatrix.concatenate(mirrorMatrix);
resultMatrix.concatenate(imageToCropBoxScaling);
return resultMatrix;
}
private static double calculateScalingFactor(double width, PageInformation pageInformation) {
// PDFBox always returns page height and width based on rotation
double pageWidth;
if (pageInformation.rotationDegrees() == 90 || pageInformation.rotationDegrees() == 270) {
pageWidth = pageInformation.height();
} else {
pageWidth = pageInformation.width();
}
return pageWidth / width;
}
@SneakyThrows
private PageInformation getPageInformation(Integer pageNumber) {
return pageInformation.get(pageNumber);
}
private record Lookups(SpanLookup<DocumentSpan> bold, SpanLookup<DocumentSpan> italic, SpanLookup<DocumentSpan> handwritten) {
public static Lookups empty() {
return new Lookups(new SpanLookup<>(Stream.empty(), Function.identity()),
new SpanLookup<>(Stream.empty(), Function.identity()),
new SpanLookup<>(Stream.empty(), Function.identity()));
}
}
}

View File

@ -0,0 +1,17 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.fonts;
import lombok.AccessLevel;
import lombok.AllArgsConstructor;
import lombok.Getter;
import lombok.experimental.FieldDefaults;
@Getter
@AllArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public final class FontMetrics {
float descent; // descent is the part of the text which is below the baseline, e.g. the lower curve of a 'g'. https://en.wikipedia.org/wiki/Body_height_(typography)
float fontSize;
float heightScaling;
}

View File

@ -0,0 +1,43 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.fonts;
import org.apache.pdfbox.pdmodel.font.PDFont;
import com.knecon.fforesight.service.viewerdoc.model.EmbeddableFont;
import lombok.SneakyThrows;
public interface FontMetricsProvider extends EmbeddableFont {
default FontMetrics calculateMetrics(String text, double textWidth, double textHeight) {
HeightAndDescent heightAndDescent = calculateHeightAndDescent(text);
float fontSize = calculateFontSize(text, textWidth);
float heightScaling = (float) (textHeight - (getMaxDescent() / 1000) / ((heightAndDescent.height() - getMaxDescent() - heightAndDescent.descent()) * 1000)) / fontSize;
return new FontMetrics((float) ((-getMaxDescent() * fontSize) + heightAndDescent.descent()) / 1000, fontSize, heightScaling);
}
@SneakyThrows
default float calculateFontSize(String text, double textWidth) {
float width;
try {
width = getFont().getStringWidth(text);
} catch (IllegalArgumentException e) {
// this means, the font has no glyph for this character
width = getFont().getAverageFontWidth() * text.length();
}
return (float) (textWidth / width) * 1000;
}
PDFont getFont();
HeightAndDescent calculateHeightAndDescent(String text);
double getMaxDescent();
}

View File

@ -0,0 +1,5 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.fonts;
public enum FontStyle {
REGULAR, BOLD, ITALIC, BOLD_ITALIC, HANDWRITTEN;
}

View File

@ -0,0 +1,5 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.fonts;
public record HeightAndDescent(float height, float descent) {
}

View File

@ -0,0 +1,178 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.fonts;
import java.io.ByteArrayInputStream;
import java.util.Set;
import org.apache.fontbox.ttf.GlyphData;
import org.apache.fontbox.ttf.TTFParser;
import org.apache.fontbox.ttf.TrueTypeFont;
import org.apache.pdfbox.io.RandomAccessReadBuffer;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import com.pdftron.pdf.Font;
import com.pdftron.pdf.PDFDoc;
import lombok.AllArgsConstructor;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@RequiredArgsConstructor
@AllArgsConstructor
public class Type0FontMetricsProvider implements FontMetricsProvider {
public static final Type0FontMetricsProvider REGULAR_INSTANCE = regular(new PDDocument());
public static final Type0FontMetricsProvider BOLD_INSTANCE = bold(new PDDocument());
public static final Type0FontMetricsProvider BOLD_ITALIC_INSTANCE = boldItalic(new PDDocument());
public static final Type0FontMetricsProvider ITALIC_INSTANCE = italic(new PDDocument());
private final String resourcePath;
private PDType0Font type0Font;
private TrueTypeFont trueTypeFont;
private PDDocument documentThisIsEmbeddedIn;
// for this specific font back-/forward-slashes have a lot of descent screwing up the font size and therefore bold detection. So if we find such a character we ignore its descent.
private static final Set<Integer> slashGlyphIds = Set.of(18, 63);
@SneakyThrows
public static Type0FontMetricsProvider regular(PDDocument document) {
String resourcePath = "fonts/cmu-regular.ttf";
return createFromResourcePath(resourcePath, document);
}
@SneakyThrows
public static Type0FontMetricsProvider bold(PDDocument document) {
String resourcePath = "fonts/cmu-bold.ttf";
return createFromResourcePath(resourcePath, document);
}
@SneakyThrows
public static Type0FontMetricsProvider italic(PDDocument document) {
String resourcePath = "fonts/cmu-italic.ttf";
return createFromResourcePath(resourcePath, document);
}
@SneakyThrows
public static Type0FontMetricsProvider boldItalic(PDDocument document) {
String resourcePath = "fonts/cmu-bold-italic.ttf";
return createFromResourcePath(resourcePath, document);
}
@SneakyThrows
@SuppressWarnings("PMD.CloseResource")
private static TrueTypeFont readFromResourcePath(String resourcePath) {
// The ttf is closed with the document, see PDType0Font line 134
try (var in = Thread.currentThread().getContextClassLoader().getResourceAsStream(resourcePath); var buffer = new RandomAccessReadBuffer(in)) {
return new TTFParser().parse(buffer);
}
}
@SneakyThrows
@SuppressWarnings("PMD.CloseResource")
private static Type0FontMetricsProvider createFromResourcePath(String resourcePath, PDDocument document) {
TrueTypeFont trueTypeFont = readFromResourcePath(resourcePath);
// since Type0Font can be descendant from any font, we need to remember the original TrueTypeFont for the glyph information
return new Type0FontMetricsProvider(resourcePath, PDType0Font.load(document, trueTypeFont, true), trueTypeFont, document); // use Type0Font for unicode support)
}
@SneakyThrows
public HeightAndDescent calculateHeightAndDescent(String text) {
byte[] bytes;
try {
bytes = type0Font.encode(text);
} catch (IllegalArgumentException e) {
log.debug("The string {} could not be parsed, using average height and descent", text);
return new HeightAndDescent(800, -50);
}
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
float descent = 0;
float height = 0;
while (in.available() > 0) {
try {
int code = type0Font.readCode(in);
int glyphId = type0Font.codeToGID(code);
GlyphData glyph = trueTypeFont.getGlyph().getGlyph(glyphId);
if (glyph == null || glyph.getBoundingBox() == null) {
continue;
}
if (!slashGlyphIds.contains(glyphId)) {
descent = Math.min(descent, glyph.getYMinimum());
}
height = Math.max(height, glyph.getYMaximum());
} catch (Exception e) {
log.debug("descent and height of string {} could not be parsed, using average fallback value!", text);
}
}
// some characters like comma or minus return very small height values, while tesseract still returns a normal-sized bounding box and therefore exploding the height scaling factors,
// so we need a minimum value. Here, 500 seems optimal for the characters "-", ",", "_"
return new HeightAndDescent(Math.max(height, 500), descent);
}
@Override
public double getMaxDescent() {
return 200; // This value is estimated to counteract azures bounding box. The lower line of the bounding box is moved up by this amount during font metrics calculation
}
@Override
public PDFont getFont() {
return type0Font;
}
@Override
@SneakyThrows
public PDFont embed(PDDocument document) {
if (documentThisIsEmbeddedIn.equals(document)) {
return getFont();
}
// no need to close, the font will be closed with the document it is embedded in
this.trueTypeFont = readFromResourcePath(resourcePath);
this.type0Font = PDType0Font.load(document, trueTypeFont, true);
this.documentThisIsEmbeddedIn = document;
return getFont();
}
@SneakyThrows
public Font embed(PDFDoc doc) {
try (var in = Thread.currentThread().getContextClassLoader().getResourceAsStream(resourcePath)) {
return Font.createCIDTrueTypeFont(doc, in, true, false);
}
}
@SneakyThrows
public void close() {
trueTypeFont.close();
}
}

View File

@ -0,0 +1,207 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.awt.Color;
import java.awt.geom.AffineTransform;
import java.util.Collection;
import java.util.List;
import java.util.Map;
import java.util.Objects;
import com.azure.ai.documentintelligence.models.BoundingRegion;
import com.azure.ai.documentintelligence.models.DocumentBarcode;
import com.azure.ai.documentintelligence.models.DocumentFigure;
import com.azure.ai.documentintelligence.models.DocumentKeyValuePair;
import com.azure.ai.documentintelligence.models.DocumentLine;
import com.azure.ai.documentintelligence.models.DocumentList;
import com.azure.ai.documentintelligence.models.DocumentListItem;
import com.azure.ai.documentintelligence.models.DocumentParagraph;
import com.azure.ai.documentintelligence.models.DocumentSection;
import com.azure.ai.documentintelligence.models.DocumentTable;
import com.azure.ai.documentintelligence.models.DocumentTableCell;
import com.azure.ai.documentintelligence.models.DocumentTableCellKind;
import com.azure.ai.documentintelligence.models.DocumentWord;
import com.azure.ai.documentintelligence.models.ParagraphRole;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.SpanLookup;
import com.knecon.fforesight.service.ocr.processor.visualizations.utils.Rectangle2DBBoxCollector;
import com.knecon.fforesight.service.ocr.processor.visualizations.utils.LineUtils;
import com.knecon.fforesight.service.ocr.v1.api.model.QuadPoint;
import com.knecon.fforesight.service.viewerdoc.layers.IdpLayerConfig;
import com.knecon.fforesight.service.viewerdoc.model.ColoredLine;
import com.knecon.fforesight.service.viewerdoc.model.ColoredRectangle;
import com.knecon.fforesight.service.viewerdoc.model.FilledRectangle;
import com.knecon.fforesight.service.viewerdoc.model.Visualizations;
import com.knecon.fforesight.service.viewerdoc.model.VisualizationsOnPage;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
@RequiredArgsConstructor
@FieldDefaults(level = AccessLevel.PRIVATE, makeFinal = true)
public class IdpLayer extends IdpLayerConfig {
public static final int LINE_WIDTH = 1;
private Map<Integer, AffineTransform> pageCtms;
public void addSection(int pageNumber, DocumentSection section, SpanLookup<DocumentWord> wordsOnPage) {
QuadPoint bbox = QuadPoint.fromRectangle2D(section.getSpans()
.stream()
.map(wordsOnPage::findElementsContainedInSpan)
.flatMap(Collection::stream)
.map(DocumentWord::getPolygon)
.map(QuadPoint::fromPolygons)
.map(QuadPoint::getBounds2D)
.collect(new Rectangle2DBBoxCollector()));
addQuadPoint(pageNumber, bbox, sections, SECTION_COLOR);
}
private void addQuadPoint(int pageNumber, QuadPoint bbox, Visualizations vis, Color color) {
var sectionsOnPage = getOrCreateVisualizationsOnPage(pageNumber, vis);
sectionsOnPage.getColoredRectangles().add(new ColoredRectangle(bbox.getTransformed(pageCtms.get(pageNumber)).getBounds2D(), color, LINE_WIDTH));
}
public void addList(DocumentList list, PageBatch pageOffset) {
for (DocumentListItem item : list.getItems()) {
addBoundingRegion(item.getBoundingRegions(), lists, PARAGRAPH_COLOR, pageOffset);
}
}
public void addBarcode(int pageNumber, DocumentBarcode barcode) {
addPolygon(pageNumber, barcode.getPolygon(), barcodes, IMAGE_COLOR);
}
public void addKeyValue(DocumentKeyValuePair keyValue, PageBatch pageOffset) {
addBoundingRegion(keyValue.getKey().getBoundingRegions(), keyValuePairs, KEY_COLOR, pageOffset);
if (keyValue.getValue() != null) {
addBoundingRegion(keyValue.getValue().getBoundingRegions(), keyValuePairs, VALUE_COLOR, pageOffset);
if (keyValue.getKey().getBoundingRegions().get(0).getPageNumber() != keyValue.getValue().getBoundingRegions().get(0).getPageNumber()) {
return;
}
int pageNumberWithOffset = pageOffset.getPageNumber(keyValue.getKey().getBoundingRegions().get(0).getPageNumber());
QuadPoint key = QuadPoint.fromPolygons(keyValue.getKey().getBoundingRegions().get(0).getPolygon());
QuadPoint value = QuadPoint.fromPolygons(keyValue.getValue().getBoundingRegions().get(0).getPolygon());
var line = LineUtils.findClosestMidpointLine(key, value);
line = LineUtils.transform(line, pageCtms.get(pageNumberWithOffset));
var arrowHead = LineUtils.createArrowHead(line, Math.min(LineUtils.length(line), 5));
var linesOnPage = getOrCreateVisualizationsOnPage(pageNumberWithOffset, keyValuePairs).getColoredLines();
linesOnPage.add(new ColoredLine(line, KEY_VALUE_BBOX_COLOR, LINE_WIDTH));
linesOnPage.add(new ColoredLine(arrowHead[0], KEY_VALUE_BBOX_COLOR, LINE_WIDTH));
linesOnPage.add(new ColoredLine(arrowHead[1], KEY_VALUE_BBOX_COLOR, LINE_WIDTH));
}
}
public void addParagraph(DocumentParagraph paragraph, PageBatch pageOffset) {
Color color;
if (paragraph.getRole() == null) {
color = PARAGRAPH_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.SECTION_HEADING)) {
color = SECTION_HEADING_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.TITLE)) {
color = TITLE_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.PAGE_HEADER)) {
color = HEADER_FOOTER_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.PAGE_FOOTER)) {
color = HEADER_FOOTER_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.PAGE_NUMBER)) {
color = HEADER_FOOTER_COLOR;
} else if (Objects.equals(paragraph.getRole(), ParagraphRole.FOOTNOTE)) {
color = FOOTNOTE_COLOR;
} else {
color = PARAGRAPH_COLOR;
}
paragraph.getBoundingRegions()
.forEach(br -> addBoundingRegion(br, paragraphs, color, pageOffset));
}
private void addBoundingRegion(BoundingRegion boundingRegion, Visualizations visualizations, Color color, PageBatch pageOffset) {
addPolygon(pageOffset.getPageNumber(boundingRegion.getPageNumber()), boundingRegion.getPolygon(), visualizations, color);
}
private void addPolygon(int pageNumber, List<Double> polygon, Visualizations visualizations, Color color) {
VisualizationsOnPage visualizationsOnPage = getOrCreateVisualizationsOnPage(pageNumber, visualizations);
visualizationsOnPage.getColoredLines().addAll(LineUtils.quadPointAsLines(QuadPoint.fromPolygons(polygon).getTransformed(pageCtms.get(pageNumber)), color));
}
public void addFigure(DocumentFigure documentFigure, PageBatch pageOffset) {
addBoundingRegion(documentFigure.getBoundingRegions(), figures, IMAGE_COLOR, pageOffset);
}
public void addTable(DocumentTable documentTable, PageBatch pageOffset) {
addBoundingRegion(documentTable.getBoundingRegions(), tables, TABLE_COLOR, pageOffset);
documentTable.getCells()
.forEach(tableCell -> addTableCell(tableCell, pageOffset));
if (documentTable.getCaption() != null) {
addBoundingRegion(documentTable.getCaption().getBoundingRegions(), tables, TITLE_COLOR, pageOffset);
}
if (documentTable.getFootnotes() != null) {
documentTable.getFootnotes()
.forEach(tc -> addBoundingRegion(tc.getBoundingRegions(), tables, KEY_COLOR, pageOffset));
}
}
private void addTableCell(DocumentTableCell tableCell, PageBatch pageOffset) {
DocumentTableCellKind kind = tableCell.getKind() == null ? DocumentTableCellKind.CONTENT : tableCell.getKind();
if (kind.equals(DocumentTableCellKind.DESCRIPTION)) {
addBoundingRegion(tableCell.getBoundingRegions(), tables, KEY_COLOR, pageOffset);
return;
}
addBoundingRegion(tableCell.getBoundingRegions(), tables, INNER_LINES_COLOR, pageOffset);
if (kind.equals(DocumentTableCellKind.COLUMN_HEADER) || kind.equals(DocumentTableCellKind.ROW_HEADER)) {
for (BoundingRegion boundingRegion : tableCell.getBoundingRegions()) {
var vis = getOrCreateVisualizationsOnPage(pageOffset.getPageNumber(boundingRegion.getPageNumber()), tables);
QuadPoint qp = QuadPoint.fromPolygons(boundingRegion.getPolygon()).getTransformed(pageCtms.get(pageOffset.getPageNumber(boundingRegion.getPageNumber())));
vis.getFilledRectangles().add(new FilledRectangle(qp.getBounds2D(), TITLE_COLOR, 0.2f));
}
}
}
private void addBoundingRegion(List<BoundingRegion> boundingRegions, Visualizations via, Color col, PageBatch pageOffset) {
for (BoundingRegion boundingRegion : boundingRegions) {
addBoundingRegion(boundingRegion, via, col, pageOffset);
}
}
public void addLine(int pageNumber, DocumentLine line) {
addQuadPoint(pageNumber, QuadPoint.fromPolygons(line.getPolygon()), lines, LINES_COLOR);
}
}

View File

@ -0,0 +1,84 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.awt.geom.AffineTransform;
import java.util.Map;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.azure.ai.documentintelligence.models.DocumentBarcode;
import com.azure.ai.documentintelligence.models.DocumentLine;
import com.azure.ai.documentintelligence.models.DocumentPage;
import com.azure.ai.documentintelligence.models.DocumentSection;
import com.azure.ai.documentintelligence.models.DocumentWord;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.SpanLookup;
import lombok.Getter;
@Getter
public class IdpLayerFactory {
private final IdpLayer idpLayer;
IdpLayerFactory(Map<Integer, AffineTransform> pageCtms) {
this.idpLayer = new IdpLayer(pageCtms);
}
public synchronized void addAnalyzeResult(AnalyzeResult analyzeResult, PageBatch pageOffset) {
addAnalyzeResult(analyzeResult, idpLayer, pageOffset);
}
private static void addAnalyzeResult(AnalyzeResult analyzeResult, IdpLayer idpLayer, PageBatch pageOffset) {
for (DocumentPage page : analyzeResult.getPages()) {
SpanLookup<DocumentWord> wordsOnPage = new SpanLookup<>(page.getWords()
.stream(), DocumentWord::getSpan);
if (analyzeResult.getSections() != null) {
for (DocumentSection section : analyzeResult.getSections()) {
idpLayer.addSection(getPageNumber(pageOffset, page), section, wordsOnPage);
}
}
if (page.getLines() != null) {
for (DocumentLine line : page.getLines()) {
idpLayer.addLine(getPageNumber(pageOffset, page), line);
}
}
if (page.getBarcodes() != null) {
for (DocumentBarcode barcode : page.getBarcodes()) {
idpLayer.addBarcode(getPageNumber(pageOffset, page), barcode);
}
}
}
if (analyzeResult.getParagraphs() != null) {
analyzeResult.getParagraphs()
.forEach(paragraph -> idpLayer.addParagraph(paragraph, pageOffset));
}
if (analyzeResult.getFigures() != null) {
analyzeResult.getFigures()
.forEach(documentFigure -> idpLayer.addFigure(documentFigure, pageOffset));
}
if (analyzeResult.getTables() != null) {
analyzeResult.getTables()
.forEach(documentTable -> idpLayer.addTable(documentTable, pageOffset));
}
if (analyzeResult.getLists() != null) {
analyzeResult.getLists()
.forEach(list -> idpLayer.addList(list, pageOffset));
}
if (analyzeResult.getKeyValuePairs() != null) {
analyzeResult.getKeyValuePairs()
.forEach(keyValue -> idpLayer.addKeyValue(keyValue, pageOffset));
}
}
private static int getPageNumber(PageBatch pageOffset, DocumentPage page) {
return pageOffset.getPageNumber(page.getPageNumber());
}
}

View File

@ -0,0 +1,73 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import com.azure.ai.documentintelligence.models.AnalyzeResult;
import com.knecon.fforesight.service.ocr.processor.model.PageBatch;
import com.knecon.fforesight.service.ocr.processor.model.PageInformation;
import com.knecon.fforesight.service.ocr.processor.service.OcrExecutionSupervisor;
import com.knecon.fforesight.service.ocr.processor.OcrServiceSettings;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingSupervisor;
import com.knecon.fforesight.service.ocr.processor.visualizations.WritableOcrResult;
import com.knecon.fforesight.service.ocr.processor.visualizations.WritableOcrResultFactory;
import com.knecon.fforesight.service.viewerdoc.layers.LayerGroup;
import lombok.AccessLevel;
import lombok.experimental.FieldDefaults;
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class LayerFactory {
OcrExecutionSupervisor supervisor;
WritableOcrResultFactory writableOcrResultFactory;
IdpLayerFactory idpLayerFactory;
OcrDebugLayerFactory ocrDebugLayerFactory;
OcrTextLayerFactory ocrTextLayerFactory;
OcrServiceSettings settings;
public LayerFactory(OcrServiceSettings settings, OcrExecutionSupervisor supervisor, ImageProcessingSupervisor imageSupervisor, Map<Integer, PageInformation> pageInformation) {
this.writableOcrResultFactory = new WritableOcrResultFactory(pageInformation, settings, imageSupervisor);
this.idpLayerFactory = new IdpLayerFactory(writableOcrResultFactory.getPageCtms());
this.ocrDebugLayerFactory = new OcrDebugLayerFactory();
this.ocrTextLayerFactory = new OcrTextLayerFactory();
this.settings = settings;
this.supervisor = supervisor;
}
public void addAnalyzeResult(PageBatch pageRange, AnalyzeResult analyzeResult) throws InterruptedException {
List<WritableOcrResult> results = writableOcrResultFactory.buildOcrResultToWrite(analyzeResult, pageRange);
ocrTextLayerFactory.addWritableOcrResult(results);
if (settings.isDebug()) {
ocrDebugLayerFactory.addAnalysisResult(results);
}
if (settings.isIdpEnabled()) {
idpLayerFactory.addAnalyzeResult(analyzeResult, pageRange);
}
this.supervisor.finishMappingResult(pageRange);
}
public OcrResult getLayers() {
OcrTextLayer ocrTextLayer = ocrTextLayerFactory.getOcrTextLayer();
List<LayerGroup> debugLayers = new LinkedList<>();
if (settings.isDebug()) {
debugLayers.add(ocrDebugLayerFactory.getOcrDebugLayer());
}
if (settings.isIdpEnabled()) {
debugLayers.add(idpLayerFactory.getIdpLayer());
}
return new OcrResult(List.of(ocrTextLayer), debugLayers);
}
}

View File

@ -0,0 +1,60 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.awt.Color;
import java.awt.geom.Line2D;
import java.util.List;
import java.util.Optional;
import org.apache.pdfbox.pdmodel.graphics.state.RenderingMode;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.processor.visualizations.utils.LineUtils;
import com.knecon.fforesight.service.viewerdoc.layers.OcrDebugLayerConfig;
import com.knecon.fforesight.service.viewerdoc.model.ColoredLine;
import com.knecon.fforesight.service.viewerdoc.model.PlacedText;
import com.knecon.fforesight.service.viewerdoc.model.VisualizationsOnPage;
public class OcrDebugLayer extends OcrDebugLayerConfig {
public void addTextPositionInImage(int pageNumber, TextPositionInImage word) {
var textOnPage = getOrCreateVisualizationsOnPage(pageNumber, debugText);
var overlappedText = getOrCreateVisualizationsOnPage(pageNumber, this.overlappedText);
var bboxOnPage = getOrCreateVisualizationsOnPage(pageNumber, debugBBox);
VisualizationsOnPage vis = word.isOverlapsIgnoreZone() ? overlappedText : textOnPage;
vis.getPlacedTexts()
.add(new PlacedText(word.getText(),
null,
getColor(word),
(float) word.getFontSize(),
word.getFontMetricsProvider(),
Optional.of(word.getTextMatrix()),
Optional.of(RenderingMode.FILL)));
bboxOnPage.getColoredLines().addAll(LineUtils.quadPointAsLines(word.getTransformedTextBBox()));
}
public void addTableLines(int pageNumber, List<Line2D> tableLines) {
var linesOnPage = getOrCreateVisualizationsOnPage(pageNumber, this.tableLines);
for (Line2D tableLine : tableLines) {
linesOnPage.getColoredLines().add(new ColoredLine(tableLine, TABLE_LINES_COLOR, 1));
}
}
private static Color getColor(TextPositionInImage word) {
return switch (word.getFontStyle()) {
case REGULAR -> REGULAR_COLOR;
case BOLD -> BOLD_COLOR;
case ITALIC -> ITALIC_COLOR;
case BOLD_ITALIC -> BOLD_ITALIC_COLOR;
case HANDWRITTEN -> HANDWRITTEN_COLOR;
};
}
}

View File

@ -0,0 +1,30 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.util.List;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.processor.visualizations.WritableOcrResult;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
@Getter
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OcrDebugLayerFactory {
OcrDebugLayer ocrDebugLayer = new OcrDebugLayer();
public synchronized void addAnalysisResult(List<WritableOcrResult> results) {
for (WritableOcrResult ocrResult : results) {
for (TextPositionInImage textPositionInImage : ocrResult.getTextPositionInImage()) {
ocrDebugLayer.addTextPositionInImage(ocrResult.getPageNumber(), textPositionInImage);
}
ocrDebugLayer.addTableLines(ocrResult.getPageNumber(), ocrResult.getTableLines());
}
}
}

View File

@ -0,0 +1,9 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.util.List;
import com.knecon.fforesight.service.viewerdoc.layers.LayerGroup;
public record OcrResult(List<LayerGroup> regularLayers, List<LayerGroup> debugLayers) {
}

View File

@ -0,0 +1,61 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.awt.Color;
import java.awt.geom.Line2D;
import java.util.List;
import java.util.Optional;
import org.apache.pdfbox.pdmodel.graphics.state.RenderingMode;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.viewerdoc.layers.OcrTextLayerConfig;
import com.knecon.fforesight.service.viewerdoc.model.ColoredLine;
import com.knecon.fforesight.service.viewerdoc.model.PlacedText;
import com.knecon.fforesight.service.viewerdoc.model.Visualizations;
import com.knecon.fforesight.service.viewerdoc.model.VisualizationsOnPage;
public class OcrTextLayer extends OcrTextLayerConfig {
public static final int LINE_WIDTH = 1;
public void addTextPositionInImage(int pageNumber, TextPositionInImage word) {
if (word.isOverlapsIgnoreZone()) {
return;
}
var textOnPage = getOrCreateVisualizationsOnPage(pageNumber, ocrText);
textOnPage.getPlacedTexts()
.add(new PlacedText(word.getText(),
null,
Color.BLACK,
(float) word.getFontSize(),
word.getFontMetricsProvider(),
Optional.of(word.getTextMatrix()),
Optional.of(RenderingMode.NEITHER)));
}
public void addTableLines(int pageNumber, List<Line2D> tableLines) {
for (Line2D line : tableLines) {
VisualizationsOnPage visualizationsOnPage = getOrCreateVisualizationsOnPageWithInvisibleLines(pageNumber, this.tableLines);
visualizationsOnPage.getColoredLines().add(new ColoredLine(line, Color.BLACK, LINE_WIDTH));
}
}
protected VisualizationsOnPage getOrCreateVisualizationsOnPageWithInvisibleLines(int page, Visualizations visualizations) {
if (visualizations.getVisualizationsOnPages().containsKey(page - 1)) {
return visualizations.getVisualizationsOnPages().get(page - 1);
} else {
VisualizationsOnPage visualizationsOnPage = VisualizationsOnPage.builder().makePathsInvisible(true).build();
visualizations.getVisualizationsOnPages().put(page - 1, visualizationsOnPage);
return visualizationsOnPage;
}
}
}

View File

@ -0,0 +1,30 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.layers;
import java.util.List;
import com.knecon.fforesight.service.ocr.processor.model.TextPositionInImage;
import com.knecon.fforesight.service.ocr.processor.visualizations.WritableOcrResult;
import lombok.AccessLevel;
import lombok.Getter;
import lombok.experimental.FieldDefaults;
@Getter
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OcrTextLayerFactory {
OcrTextLayer ocrTextLayer = new OcrTextLayer();
public synchronized void addWritableOcrResult(List<WritableOcrResult> ocrResults) {
for (WritableOcrResult ocrResult : ocrResults) {
for (TextPositionInImage textPositionInImage : ocrResult.getTextPositionInImage()) {
ocrTextLayer.addTextPositionInImage(ocrResult.getPageNumber(), textPositionInImage);
}
ocrTextLayer.addTableLines(ocrResult.getPageNumber(), ocrResult.getTableLines());
}
}
}

View File

@ -0,0 +1,118 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.utils;
import java.awt.Color;
import java.awt.geom.AffineTransform;
import java.awt.geom.Line2D;
import java.awt.geom.Point2D;
import java.util.List;
import com.knecon.fforesight.service.ocr.v1.api.model.QuadPoint;
import com.knecon.fforesight.service.viewerdoc.model.ColoredLine;
import lombok.experimental.UtilityClass;
@UtilityClass
public class LineUtils {
public List<ColoredLine> quadPointAsLines(QuadPoint rect) {
return List.of(new ColoredLine(new Line2D.Double(rect.a(), rect.b()), Color.ORANGE, 1),
new ColoredLine(new Line2D.Double(rect.b(), rect.c()), Color.BLUE, 1),
new ColoredLine(new Line2D.Double(rect.c(), rect.d()), Color.GREEN, 1),
new ColoredLine(new Line2D.Double(rect.d(), rect.a()), Color.MAGENTA, 1));
}
public List<ColoredLine> quadPointAsLines(QuadPoint rect, Color color) {
return List.of(new ColoredLine(new Line2D.Double(rect.a(), rect.b()), color, 1),
new ColoredLine(new Line2D.Double(rect.b(), rect.c()), color, 1),
new ColoredLine(new Line2D.Double(rect.c(), rect.d()), color, 1),
new ColoredLine(new Line2D.Double(rect.d(), rect.a()), color, 1));
}
public static Line2D transform(Line2D line2D, AffineTransform affineTransform) {
var p1 = affineTransform.transform(line2D.getP1(), null);
var p2 = affineTransform.transform(line2D.getP2(), null);
return new Line2D.Double(p1, p2);
}
public static double length(Line2D line2D) {
return line2D.getP1().distance(line2D.getP2());
}
public static Line2D findClosestMidpointLine(QuadPoint quad1, QuadPoint quad2) {
List<Line2D> lines1 = quad1.asLines()
.toList();
List<Line2D> lines2 = quad2.asLines()
.toList();
Line2D closestLine1 = null;
Line2D closestLine2 = null;
double minDistance = Double.MAX_VALUE;
for (Line2D line1 : lines1) {
for (Line2D line2 : lines2) {
double distance = lineDistance(line1, line2);
if (distance < minDistance) {
minDistance = distance;
closestLine1 = line1;
closestLine2 = line2;
}
}
}
if (closestLine1 == null || closestLine2 == null) {
throw new IllegalStateException("Could not find closest lines");
}
Point2D midpoint1 = getMidpoint(closestLine1);
Point2D midpoint2 = getMidpoint(closestLine2);
return new Line2D.Double(midpoint1, midpoint2);
}
private static double lineDistance(Line2D line1, Line2D line2) {
return Math.abs(getMidpoint(line1).distance(getMidpoint(line2)));
}
private static Point2D getMidpoint(Line2D line) {
double x = (line.getX1() + line.getX2()) / 2;
double y = (line.getY1() + line.getY2()) / 2;
return new Point2D.Double(x, y);
}
public static Line2D[] createArrowHead(Line2D line, double arrowLength) {
Point2D start = line.getP1();
Point2D end = line.getP2();
// Calculate the angle of the line
double angle = Math.atan2(end.getY() - start.getY(), end.getX() - start.getX());
// Calculate the points for the two arrow lines
double arrowHeadAngle = Math.PI / 6;
double x1 = end.getX() - arrowLength * Math.cos(angle - arrowHeadAngle);
double y1 = end.getY() - arrowLength * Math.sin(angle - arrowHeadAngle);
double x2 = end.getX() - arrowLength * Math.cos(angle + arrowHeadAngle);
double y2 = end.getY() - arrowLength * Math.sin(angle + arrowHeadAngle);
// Create and return the two arrow lines
Line2D arrow1 = new Line2D.Double(end, new Point2D.Double(x1, y1));
Line2D arrow2 = new Line2D.Double(end, new Point2D.Double(x2, y2));
return new Line2D[]{arrow1, arrow2};
}
}

View File

@ -0,0 +1,105 @@
package com.knecon.fforesight.service.ocr.processor.visualizations.utils;
import java.awt.geom.Rectangle2D;
import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.BinaryOperator;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.stream.Collector;
import lombok.AllArgsConstructor;
import lombok.NoArgsConstructor;
public class Rectangle2DBBoxCollector implements Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> {
@Override
public Supplier<BBox> supplier() {
return BBox::new;
}
@Override
public BiConsumer<BBox, Rectangle2D> accumulator() {
return BBox::addRectangle;
}
@Override
public BinaryOperator<BBox> combiner() {
return (b1, b2) -> new BBox(Math.min(b1.lowerLeftX, b2.lowerLeftX),
Math.min(b1.lowerLeftY, b2.lowerLeftY),
Math.max(b1.upperRightX, b2.upperRightX),
Math.max(b1.upperRightY, b2.upperRightY));
}
@Override
public Function<BBox, Rectangle2D> finisher() {
return BBox::toRectangle2D;
}
@Override
public Set<Characteristics> characteristics() {
return Set.of(Characteristics.UNORDERED);
}
@AllArgsConstructor
@NoArgsConstructor
public static class BBox {
Double lowerLeftX;
Double lowerLeftY;
Double upperRightX;
Double upperRightY;
public Rectangle2D toRectangle2D() {
if (lowerLeftX == null || lowerLeftY == null || upperRightX == null || upperRightY == null) {
return new Rectangle2D.Double(0, 0, 0, 0);
}
return new Rectangle2D.Double(lowerLeftX, lowerLeftY, upperRightX - lowerLeftX, upperRightY - lowerLeftY);
}
public void addRectangle(Rectangle2D rectangle2D) {
double lowerLeftX = Math.min(rectangle2D.getMinX(), rectangle2D.getMaxX());
double lowerLeftY = Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY());
double upperRightX = Math.max(rectangle2D.getMinX(), rectangle2D.getMaxX());
double upperRightY = Math.max(rectangle2D.getMinY(), rectangle2D.getMaxY());
if (this.lowerLeftX == null) {
this.lowerLeftX = lowerLeftX;
} else if (this.lowerLeftX > lowerLeftX) {
this.lowerLeftX = lowerLeftX;
}
if (this.lowerLeftY == null) {
this.lowerLeftY = lowerLeftY;
} else if (this.lowerLeftY > lowerLeftY) {
this.lowerLeftY = lowerLeftY;
}
if (this.upperRightX == null) {
this.upperRightX = upperRightX;
} else if (this.upperRightX < upperRightX) {
this.upperRightX = upperRightX;
}
if (this.upperRightY == null) {
this.upperRightY = upperRightY;
} else if (this.upperRightY < upperRightY) {
this.upperRightY = upperRightY;
}
}
}
}

View File

@ -0,0 +1,83 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.io.File;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.util.HashSet;
import java.util.Set;
import org.apache.pdfbox.Loader;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.springframework.core.io.ClassPathResource;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.GhostScriptService;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingPipeline;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingService;
import com.knecon.fforesight.service.ocr.processor.service.imageprocessing.ImageProcessingSupervisor;
import com.knecon.fforesight.service.ocr.processor.utils.OsUtils;
import com.sun.jna.NativeLibrary;
import lombok.SneakyThrows;
@Disabled // Leptonica is not available during testing on server
class ImageProcessingPipelineTest {
ImageProcessingPipeline imageProcessingPipeline;
@BeforeEach
public void setup() {
System.setProperty("jna.library.path", System.getenv("VCPKG_DYNAMIC_LIB"));
try (NativeLibrary leptonicaLib = NativeLibrary.getInstance("leptonica")) {
assert leptonicaLib != null;
}
ImageProcessingService imageProcessingService = new ImageProcessingService();
GhostScriptService ghostScriptService = new GhostScriptService();
imageProcessingPipeline = new ImageProcessingPipeline(ghostScriptService, imageProcessingService);
}
@Test
@SneakyThrows
public void testImageProcessingPipeline() {
String fileName = "/home/kschuettler/Dokumente/TestFiles/OCR/VV-331340.pdf";
File file;
if (fileName.startsWith("files")) {
file = new ClassPathResource(fileName).getFile();
} else {
file = new File(fileName);
}
Path tmpDir = Path.of(OsUtils.getTemporaryDirectory()).resolve("IMAGE_PROCESSING_TEST").resolve(file.toPath().getFileName());
assert tmpDir.toFile().exists() || tmpDir.toFile().mkdirs();
var documentFile = tmpDir.resolve(Path.of("document.pdf"));
Files.copy(file.toPath(), documentFile, StandardCopyOption.REPLACE_EXISTING);
int numberOfpages;
try (var doc = Loader.loadPDF(file)) {
numberOfpages = doc.getNumberOfPages();
}
Set<Integer> pageNumbers = new HashSet<>();
for (int i = 1; i <= numberOfpages; i++) {
if (i % 2 == 0) {
continue;
}
pageNumbers.add(i);
}
ImageProcessingSupervisor supervisor = imageProcessingPipeline.run(pageNumbers, tmpDir.resolve("images"), documentFile.toFile());
supervisor.awaitAll();
}
}

View File

@ -0,0 +1,28 @@
package com.knecon.fforesight.service.ocr.processor.service;
import java.io.File;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.FontMetrics;
import com.knecon.fforesight.service.ocr.processor.visualizations.fonts.Type0FontMetricsProvider;
import lombok.SneakyThrows;
@SuppressWarnings("PMD")
class Type0FontMetricsProviderTest {
@Test
@SneakyThrows
public void testStringWidth() {
try (PDDocument document = Loader.loadPDF(new File(Type0FontMetricsProviderTest.class.getClassLoader().getResource("InvisibleText.pdf").getPath()))) {
Type0FontMetricsProvider metricsFactory = Type0FontMetricsProvider.regular(document);
FontMetrics fontMetrics = metricsFactory.calculateMetrics("deine mutter", 100, 50);
}
}
}

View File

@ -0,0 +1,21 @@
package com.knecon.fforesight.service.ocr.processor.utils;
import static org.junit.jupiter.api.Assertions.assertEquals;
import java.util.Collection;
import org.junit.jupiter.api.Test;
class ListSplittingUtilsTest {
@Test
public void testBalancedListSplitting() {
int threadCount = 18;
int numberOfPages = 48;
var balancedList = ListSplittingUtils.buildBalancedContinuousSublist(numberOfPages, threadCount);
assertEquals(threadCount, balancedList.size());
assertEquals(numberOfPages, balancedList.stream().mapToLong(Collection::size).sum());
}
}

View File

@ -0,0 +1,16 @@
<Configuration>
<Appenders>
<Console name="CONSOLE" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="warn">
<AppenderRef ref="CONSOLE"/>
</Root>
<Logger name="com.knecon" level="info"/>
</Loggers>
</Configuration>

View File

@ -0,0 +1,78 @@
import org.springframework.boot.gradle.tasks.bundling.BootBuildImage
plugins {
application
id("com.knecon.fforesight.service.java-conventions")
id("org.springframework.boot") version "3.2.3"
id("io.spring.dependency-management") version "1.1.3"
id("org.sonarqube") version "4.3.0.3225"
id("io.freefair.lombok") version "8.4"
}
configurations {
all {
exclude(group = "commons-logging", module = "commons-logging")
exclude(group = "org.springframework.boot", module = "spring-boot-starter-log4j2")
exclude(group = "com.iqser.red.commons", module = "logging-commons")
}
}
val springBootStarterVersion = "3.2.3"
dependencies {
implementation(project(":azure-ocr-service-processor"))
implementation(project(":azure-ocr-service-api"))
implementation("com.knecon.fforesight:tracing-commons:0.5.0")
implementation("org.springframework.cloud:spring-cloud-starter-openfeign:4.1.1")
implementation("org.springframework.boot:spring-boot-starter-amqp:${springBootStarterVersion}")
implementation("net.logstash.logback:logstash-logback-encoder:7.4")
implementation("ch.qos.logback:logback-classic")
testImplementation("org.springframework.boot:spring-boot-starter-test:${springBootStarterVersion}")
testImplementation("com.iqser.red.commons:test-commons:2.1.0")
testImplementation("org.springframework.amqp:spring-rabbit-test:3.0.2")
}
tasks.named<BootBuildImage>("bootBuildImage") {
environment.put("BPE_DELIM_JAVA_TOOL_OPTIONS", " ")
environment.put("BPE_APPEND_JAVA_TOOL_OPTIONS", "-Dfile.encoding=UTF-8")
environment.put("BPE_GS_LIB", "/layers/fagiani_apt/apt/usr/share/ghostscript/9.55.0/Resource/Init/") // set ghostscript lib path, version in path must match version in Aptfile
environment.put("BPE_FONTCONFIG_PATH", "/layers/fagiani_apt/apt/etc/fonts/") // set ghostscript fontconfig path
val aptFile = layout.projectDirectory.file("src/main/resources/Aptfile").toString()
bindings.add("${aptFile}:/workspace/Aptfile:ro")
val vcpkgFile = layout.projectDirectory.file("src/main/resources/vcpkg.json").toString()
bindings.add("${vcpkgFile}:/workspace/vcpkg.json:ro")
buildpacks.set(
listOf(
"ghcr.io/knsita/buildpacks/fagiani_apt@sha256:9771d4d27d8050aee62769490b8882fffc794745c129fb98e1f33196e2c93504",
"ghcr.io/kschuettler/knecon-vcpkg@sha256:ba5e967b124de4865ff7e8f565684f752dd6e97b302e2dcf651283f6a19b98b9",
"urn:cnb:builder:paketo-buildpacks/java"
)
)
imageName.set("nexus.knecon.com:5001/ff/${project.name}") // must build image with same name always, otherwise the builder will not know which image to use as cache. DO NOT CHANGE!
if (project.hasProperty("buildbootDockerHostNetwork")) {
network.set("host")
}
docker {
if (project.hasProperty("buildbootDockerHostNetwork")) {
bindHostToBuilder.set(true)
}
verboseLogging.set(true)
publishRegistry {
username.set(providers.gradleProperty("mavenUser").getOrNull())
password.set(providers.gradleProperty("mavenPassword").getOrNull())
email.set(providers.gradleProperty("mavenEmail").getOrNull())
url.set("https://nexus.knecon.com:5001/")
}
val dockerTag = "nexus.knecon.com:5001/ff/${project.name}:${project.version}"
tags.set(listOf(dockerTag))
}
}

View File

@ -0,0 +1,59 @@
package com.knecon.fforesight.service.ocr.v1.server;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.actuate.autoconfigure.security.servlet.ManagementWebSecurityAutoConfiguration;
import org.springframework.boot.autoconfigure.ImportAutoConfiguration;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.boot.autoconfigure.security.servlet.SecurityAutoConfiguration;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Import;
import org.springframework.scheduling.annotation.EnableAsync;
import com.iqser.red.pdftronlogic.commons.InvisibleElementRemovalService;
import com.iqser.red.pdftronlogic.commons.WatermarkRemovalService;
import com.knecon.fforesight.service.ocr.processor.OcrServiceProcessorConfiguration;
import com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration;
import com.iqser.red.storage.commons.StorageAutoConfiguration;
import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration;
import io.micrometer.core.aop.TimedAspect;
import io.micrometer.core.instrument.MeterRegistry;
@EnableAsync
@ImportAutoConfiguration({MultiTenancyAutoConfiguration.class})
@SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class})
@Import({MessagingConfiguration.class, StorageAutoConfiguration.class, OcrServiceProcessorConfiguration.class})
public class Application {
/**
* Entry point to the service application.
*
* @param args Any command line parameter given upon startup.
*/
public static void main(String[] args) {
System.setProperty("org.apache.pdfbox.rendering.UsePureJavaCMYKConversion", "true");
SpringApplication.run(Application.class, args);
}
@Bean
public TimedAspect timedAspect(MeterRegistry registry) {
return new TimedAspect(registry);
}
@Bean
public InvisibleElementRemovalService invisibleElementRemovalService() {
return new InvisibleElementRemovalService();
}
@Bean
public WatermarkRemovalService watermarkRemovalService() {
return new WatermarkRemovalService();
}
}

View File

@ -0,0 +1,21 @@
package com.knecon.fforesight.service.ocr.v1.server.configuration;
import org.springframework.context.annotation.Configuration;
import lombok.RequiredArgsConstructor;
@Configuration
@RequiredArgsConstructor
public class MessagingConfiguration {
public static final String OCR_REQUEST_QUEUE_PREFIX = "ocr_request_queue";
public static final String OCR_REQUEST_EXCHANGE = "ocr_request_exchange";
public static final String OCR_DLQ = "ocr_dlq";
public static final String OCR_RESPONSE_EXCHANGE = "ocr_response_exchange";
public static final String OCR_STATUS_UPDATE_EXCHANGE = "ocr_status_update_exchange";
public static final String OCR_STATUS_UPDATE_DLQ = "ocr_status_update_dlq";
public static final String X_ERROR_INFO_HEADER = "x-error-message";
public static final String X_ERROR_INFO_TIMESTAMP_HEADER = "x-error-message-timestamp";
}

View File

@ -0,0 +1,11 @@
package com.knecon.fforesight.service.ocr.v1.server.configuration;
import org.springframework.context.annotation.Configuration;
import com.knecon.fforesight.tenantcommons.queue.TenantMessagingConfiguration;
@Configuration
public class TenantMessagingConfigurationImpl extends TenantMessagingConfiguration {
}

View File

@ -0,0 +1,45 @@
package com.knecon.fforesight.service.ocr.v1.server.queue;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.service.IOcrMessageSender;
import com.knecon.fforesight.service.ocr.v1.api.model.DocumentRequest;
import com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration;
import com.knecon.fforesight.tenantcommons.TenantContext;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
@ConditionalOnProperty(value = "ocrService.sendStatusUpdates", havingValue = "false", matchIfMissing = true)
public class NoStatusUpdateOcrMessageSender implements IOcrMessageSender {
RabbitTemplate rabbitTemplate;
public void sendOcrFinished(String fileId, int totalImages) {
}
public void sendOCRStarted(String fileId) {
}
public void sendUpdate(String fileId, int finishedImages, int totalImages) {
}
public void sendOcrResponse(String dossierId, String fileId) {
rabbitTemplate.convertAndSend(MessagingConfiguration.OCR_RESPONSE_EXCHANGE, TenantContext.getTenantId(), new DocumentRequest(dossierId, fileId));
}
}

View File

@ -0,0 +1,83 @@
package com.knecon.fforesight.service.ocr.v1.server.queue;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.time.OffsetDateTime;
import java.time.temporal.ChronoUnit;
import org.springframework.amqp.AmqpRejectAndDontRequeueException;
import org.springframework.amqp.core.Message;
import org.springframework.amqp.rabbit.annotation.RabbitHandler;
import org.springframework.amqp.rabbit.annotation.RabbitListener;
import org.springframework.stereotype.Service;
import org.springframework.util.FileSystemUtils;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.knecon.fforesight.service.ocr.processor.service.FileStorageService;
import com.knecon.fforesight.service.ocr.processor.service.IOcrMessageSender;
import com.knecon.fforesight.service.ocr.processor.service.OCRService;
import com.knecon.fforesight.service.ocr.v1.api.model.DocumentRequest;
import com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@RequiredArgsConstructor
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OcrMessageReceiver {
public static final String OCR_REQUEST_LISTENER_ID = "ocr-request-listener";
FileStorageService fileStorageService;
ObjectMapper objectMapper;
OCRService ocrService;
IOcrMessageSender ocrMessageSender;
@RabbitHandler
@RabbitListener(id = OCR_REQUEST_LISTENER_ID, concurrency = "1")
public void receiveOcr(Message in) throws IOException {
if (in.getMessageProperties().isRedelivered()) {
throw new AmqpRejectAndDontRequeueException("Redelivered OCR Request, aborting...");
}
DocumentRequest request = objectMapper.readValue(in.getBody(), DocumentRequest.class);
String dossierId = request.getDossierId();
String fileId = request.getFileId();
Path tmpDir = Files.createTempDirectory(null);
try {
log.info("--------------------------------------------------------------------------");
log.info("Start ocr for file with dossierId {} and fileId {}", dossierId, fileId);
ocrMessageSender.sendOCRStarted(fileId);
File documentFile = tmpDir.resolve("document.pdf").toFile();
File viewerDocumentFile = tmpDir.resolve("viewerDocument.pdf").toFile();
File analyzeResultFile = tmpDir.resolve("azureAnalysisResult.json").toFile();
fileStorageService.downloadFiles(request, documentFile);
ocrService.runOcrOnDocument(dossierId, fileId, request.isRemoveWatermarks(), tmpDir, documentFile, viewerDocumentFile, analyzeResultFile);
fileStorageService.storeFiles(request, documentFile, viewerDocumentFile, analyzeResultFile);
ocrMessageSender.sendOcrResponse(dossierId, fileId);
} catch (Exception e) {
log.warn("An exception occurred in ocr file stage: {}", e.getMessage());
in.getMessageProperties().getHeaders().put(MessagingConfiguration.X_ERROR_INFO_HEADER, e.getMessage());
in.getMessageProperties().getHeaders().put(MessagingConfiguration.X_ERROR_INFO_TIMESTAMP_HEADER, OffsetDateTime.now().truncatedTo(ChronoUnit.MILLIS));
throw new RuntimeException(e);
} finally {
FileSystemUtils.deleteRecursively(tmpDir);
}
}
}

View File

@ -0,0 +1,59 @@
package com.knecon.fforesight.service.ocr.v1.server.queue;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.boot.autoconfigure.condition.ConditionalOnProperty;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.ocr.processor.service.IOcrMessageSender;
import com.knecon.fforesight.service.ocr.v1.api.model.DocumentRequest;
import com.knecon.fforesight.service.ocr.v1.api.model.OCRStatusUpdateResponse;
import com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration;
import com.knecon.fforesight.tenantcommons.TenantContext;
import lombok.AccessLevel;
import lombok.RequiredArgsConstructor;
import lombok.experimental.FieldDefaults;
import lombok.extern.slf4j.Slf4j;
@Slf4j
@Service
@RequiredArgsConstructor
@ConditionalOnProperty(value = "ocrService.sendStatusUpdates", havingValue = "true")
@FieldDefaults(makeFinal = true, level = AccessLevel.PRIVATE)
public class OcrMessageSender implements IOcrMessageSender {
RabbitTemplate rabbitTemplate;
public void sendOcrFinished(String fileId, int totalImages) {
rabbitTemplate.convertAndSend(MessagingConfiguration.OCR_STATUS_UPDATE_EXCHANGE,
TenantContext.getTenantId(),
OCRStatusUpdateResponse.builder().fileId(fileId).numberOfPagesToOCR(totalImages).numberOfOCRedPages(totalImages).ocrFinished(true).build());
}
public void sendOCRStarted(String fileId) {
rabbitTemplate.convertAndSend(MessagingConfiguration.OCR_STATUS_UPDATE_EXCHANGE,
TenantContext.getTenantId(),
OCRStatusUpdateResponse.builder().fileId(fileId).ocrStarted(true).build());
}
public void sendUpdate(String fileId, int finishedImages, int totalImages) {
rabbitTemplate.convertAndSend(MessagingConfiguration.OCR_STATUS_UPDATE_EXCHANGE,
TenantContext.getTenantId(),
OCRStatusUpdateResponse.builder().fileId(fileId).numberOfPagesToOCR(totalImages).numberOfOCRedPages(finishedImages).build());
}
public void sendOcrResponse(String dossierId, String fileId) {
rabbitTemplate.convertAndSend(MessagingConfiguration.OCR_RESPONSE_EXCHANGE, TenantContext.getTenantId(), new DocumentRequest(dossierId, fileId));
}
}

View File

@ -0,0 +1,70 @@
package com.knecon.fforesight.service.ocr.v1.server.queue;
import static com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration.OCR_DLQ;
import static com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration.OCR_REQUEST_EXCHANGE;
import static com.knecon.fforesight.service.ocr.v1.server.configuration.MessagingConfiguration.OCR_REQUEST_QUEUE_PREFIX;
import java.util.Map;
import java.util.Set;
import org.springframework.amqp.rabbit.annotation.RabbitHandler;
import org.springframework.amqp.rabbit.annotation.RabbitListener;
import org.springframework.boot.context.event.ApplicationReadyEvent;
import org.springframework.context.event.EventListener;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.tenantcommons.TenantProvider;
import com.knecon.fforesight.tenantcommons.model.TenantCreatedEvent;
import com.knecon.fforesight.tenantcommons.model.TenantQueueConfiguration;
import com.knecon.fforesight.tenantcommons.model.TenantResponse;
import com.knecon.fforesight.tenantcommons.queue.RabbitQueueFromExchangeService;
import com.knecon.fforesight.tenantcommons.queue.TenantExchangeMessageReceiver;
@Service
public class TenantExchangeMessageReceiverImpl extends TenantExchangeMessageReceiver {
public TenantExchangeMessageReceiverImpl(RabbitQueueFromExchangeService rabbitQueueService, TenantProvider tenantProvider) {
super(rabbitQueueService, tenantProvider);
}
@Override
protected Set<TenantQueueConfiguration> getTenantQueueConfigs() {
return Set.of(TenantQueueConfiguration.builder()
.listenerId(OcrMessageReceiver.OCR_REQUEST_LISTENER_ID)
.exchangeName(OCR_REQUEST_EXCHANGE)
.queuePrefix(OCR_REQUEST_QUEUE_PREFIX)
.dlqName(OCR_DLQ)
.arguments(Map.of("x-max-priority", 2))
.build());
}
@EventListener(ApplicationReadyEvent.class)
public void onApplicationReady() {
System.out.println("application ready invoked");
super.initializeQueues();
}
@RabbitHandler
@RabbitListener(queues = "#{tenantMessagingConfigurationImpl.getTenantCreatedQueueName()}")
public void reactToTenantCreation(TenantCreatedEvent tenantCreatedEvent) {
super.reactToTenantCreation(tenantCreatedEvent);
}
@RabbitHandler
@RabbitListener(queues = "#{tenantMessagingConfigurationImpl.getTenantDeletedQueueName()}")
public void reactToTenantDeletion(TenantResponse tenantResponse) {
super.reactToTenantDeletion(tenantResponse);
}
}

View File

@ -0,0 +1,20 @@
# you can list packages
ghostscript=9.55.0~dfsg1-0ubuntu5.9
pkg-config
zip
unzip
curl
# Ghostscript dependencies which are Ubuntu defaults and therefore not normally installed via apt
libgssapi-krb5-2
libk5crypto3
libkrb5support0
libkeyutils1
libkrb5-3
libbrotli1
# or include links to specific .deb files
# http://ftp.debian.org/debian/pool/contrib/m/msttcorefonts/ttf-mscorefonts-installer_3.8_all.deb
# or add custom apt repos (only required if using packages outside of the standard Ubuntu APT repositories)
# :repo:deb http://cz.archive.ubuntu.com/ubuntu artful main universe

View File

@ -0,0 +1,11 @@
server:
port: 8097
persistence-service.url: "http://localhost:8085"
tenant-user-management-service.url: "http://localhost:8091/internal"
pdftron.license: demo:1650351709282:7bd235e003000000004ec28a6743e1163a085e2115de2536ab6e2cfe5a
azure:
endpoint: https://ff-ocr-test.cognitiveservices.azure.com/
key: # find key in Bitwarden under: Azure IDP Test Key

View File

@ -0,0 +1,65 @@
info:
description: OCR Service V1 Server
persistence-service.url: "http://persistence-service-v1:8080"
tenant-user-management-service.url: "http://tenant-user-management-service:8080/internal"
fforesight.tenants.remote: true
logging.type: ${LOGGING_TYPE:CONSOLE}
kubernetes.namespace: ${NAMESPACE:default}
project.version: 1.0-SNAPSHOT
server:
port: 8080
spring:
application:
name: ocr-service
profiles:
active: kubernetes
rabbitmq:
host: ${RABBITMQ_HOST:localhost}
port: ${RABBITMQ_PORT:5672}
username: ${RABBITMQ_USERNAME:user}
password: ${RABBITMQ_PASSWORD:rabbitmq}
listener:
simple:
acknowledge-mode: AUTO
concurrency: 2
retry:
enabled: true
max-attempts: 3
max-interval: 15000
prefetch: 1
fforesight:
keycloak:
ignored-endpoints: [ '/actuator/health', '/actuator/health/**' ]
enabled: true
logging.pattern.level: "%5p [${spring.application.name},%X{traceId:-},%X{spanId:-}]"
management:
endpoint:
metrics.enabled: ${monitoring.enabled:false}
prometheus.enabled: ${monitoring.enabled:false}
health.enabled: true
endpoints.web.exposure.include: prometheus, health, metrics
metrics.export.prometheus.enabled: ${monitoring.enabled:false}
tracing:
enabled: ${TRACING_ENABLED:false}
sampling:
probability: ${TRACING_PROBABILITY:1.0}
otlp:
tracing:
endpoint: ${OTLP_ENDPOINT:http://otel-collector-opentelemetry-collector.otel-collector:4318/v1/traces}
pdftron.license: ${PDFTRON_LICENSE}
azure:
endpoint: ${AZURE_ENDPOINT}
key: ${AZURE_KEY}
ocrService:
sendStatusUpdates: true

View File

@ -0,0 +1,22 @@
------------------------------------------------------------------
| |
| OCR Service V1 Server |
| |
________________________________________________________________
| |
| ___ ________ ________ _________ _________ |
| | | / \ / || || \ |
| | | / ____ \ / _____|| _____||______ \ |
| | | / / \ \| / | | \ | |
| | || | | || \____ | |____ ______/ | |
| | || | | || \ | | | | |
| | || | ___| | \_____ || ____| | _ _/ |
| | || | \ \ | \ || | | | \ \ |
| | | \ \_ \ / ______/ || |_____ | | \ \ |
| | | \ \ \ \ | /| || | \ \ |
| |___| \____\ \___\|_________/ |_________||___| \___\ |
| |
| |
| F r o m d a t a t o i n f o r m a t i o n |
| |
|________________________________________________________________|

View File

@ -0,0 +1,7 @@
spring:
application:
name: ocr-service-v1
management.endpoints:
web.base-path: /
enabled-by-default: false

View File

@ -0,0 +1,17 @@
<configuration>
<springProperty scope="configuration" name="logType" source="logging.type"/>
<springProperty scope="context" name="application.name" source="spring.application.name"/>
<springProperty scope="context" name="version" source="project.version"/>
<include resource="org/springframework/boot/logging/logback/defaults.xml"/>
<include resource="org/springframework/boot/logging/logback/console-appender.xml"/>
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder"/>
</appender>
<root level="INFO">
<appender-ref ref="${logType}"/>
</root>
</configuration>

View File

@ -0,0 +1,12 @@
{
"dependencies": [
"leptonica"
],
"overrides": [
{
"name": "leptonica",
"version": "1.83.1"
}
],
"builtin-baseline": "3715d743ac08146d9b7714085c1babdba9f262d5"
}

View File

@ -0,0 +1,112 @@
package com.knecon.fforesight.service.ocr.v1.server;
import static org.assertj.core.api.Assertions.assertThat;
import org.junit.jupiter.api.AfterAll;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.MockitoAnnotations;
import org.mockito.junit.jupiter.MockitoExtension;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
import org.springframework.boot.test.autoconfigure.actuate.observability.AutoConfigureObservability;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.FilterType;
import org.springframework.context.annotation.Import;
import org.springframework.context.annotation.Primary;
import org.springframework.test.context.junit.jupiter.SpringExtension;
import com.iqser.red.commons.jackson.ObjectMapperFactory;
import com.iqser.red.storage.commons.StorageAutoConfiguration;
import com.iqser.red.storage.commons.service.StorageService;
import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService;
import com.knecon.fforesight.service.ocr.processor.initializer.NativeLibrariesInitializer;
import com.knecon.fforesight.tenantcommons.TenantsClient;
import com.pdftron.pdf.PDFNet;
import jakarta.annotation.PostConstruct;
import lombok.SneakyThrows;
@ExtendWith({SpringExtension.class, MockitoExtension.class})
@SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@Import({AbstractTest.TestConfiguration.class, NativeLibrariesInitializer.class})
@AutoConfigureObservability
public class AbstractTest {
protected static final String TEST_DOSSIER_TEMPLATE_ID = "1337";
protected static final String TEST_DOSSIER_ID = "42";
@MockBean
private TenantsClient tenantsClient;
@Autowired
protected StorageService storageService;
@MockBean
protected RabbitTemplate rabbitTemplate;
private static String pdftronLicense;
@BeforeEach
public void openMocks() {
MockitoAnnotations.openMocks(this);
}
@PostConstruct
@SneakyThrows
public void init() {
PDFNet.initialize(pdftronLicense);
}
@AfterAll
public static void terminatePDFNet() {
PDFNet.terminate();
System.out.println("PDFNet Terminated");
}
@SneakyThrows
public void dummyTest() {
// Build needs one test to not fail.
assertThat(1).isEqualTo(1);
}
@AfterEach
public void cleanupStorage() {
if (this.storageService instanceof FileSystemBackedStorageService) {
((FileSystemBackedStorageService) this.storageService).clearStorage();
}
}
@Configuration
@EnableAutoConfiguration(exclude = {RabbitAutoConfiguration.class})
@ComponentScan(excludeFilters = {@ComponentScan.Filter(type = FilterType.ASSIGNABLE_TYPE, value = StorageAutoConfiguration.class)})
public static class TestConfiguration {
@Bean
@Primary
public StorageService inMemoryStorage() {
return new FileSystemBackedStorageService(ObjectMapperFactory.create());
}
}
}

View File

@ -0,0 +1,119 @@
package com.knecon.fforesight.service.ocr.v1.server;
import static com.iqser.red.pdftronlogic.commons.PdfTextExtraction.extractAllTextFromDocument;
import java.io.File;
import java.io.FileInputStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.util.Comparator;
import java.util.List;
import java.util.concurrent.atomic.AtomicInteger;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.core.io.ClassPathResource;
import com.knecon.fforesight.service.ocr.processor.service.OCRService;
import com.knecon.fforesight.service.ocr.processor.utils.OsUtils;
import lombok.SneakyThrows;
@Disabled // in order to run, the azure.key must be set first in the application.yml
@SpringBootTest()
public class OcrServiceIntegrationTest extends AbstractTest {
@Autowired
private OCRService ocrService;
@Test
@SneakyThrows
public void testOcrWith2000PageFile() {
testOCR("/home/kschuettler/Dokumente/TestFiles/OCR/VV-331340-first100.pdf");
}
@Test
@SneakyThrows
public void testOcrMeto1() {
testOCR("/home/kschuettler/Dokumente/TestFiles/RM syngenta standard/101 S-Metolachlor_RAR_01_Volume_1_2018-09-06.pdf");
}
@Test
@SneakyThrows
public void testOcrWithFile() {
testOCR("/home/kschuettler/Dokumente/TestFiles/syn-dm-testfiles/1.A16148F - Toxicidade oral aguda.pdf");
}
@Test
@SneakyThrows
public void testOcrWithFolder() {
String dir = "/home/kschuettler/Dokumente/TestFiles/BASF/Documine_Test_docs/2013-1110704.pdf";
List<File> foundFiles = Files.walk(Path.of(dir))
.sorted(Comparator.comparingLong(this::getFileSize))
.map(Path::toFile)
.filter(file -> file.getName().endsWith(".pdf"))
.peek(System.out::println)
.toList();
int fileCount = foundFiles.size();
AtomicInteger processedCount = new AtomicInteger();
System.out.printf("Found %s files, starting OCR for each.%n%n", fileCount);
foundFiles.stream()
.peek(file -> System.out.printf("%s/%s: %s%n", processedCount.getAndIncrement(), fileCount, file))
.forEach(this::testOCR);
}
@SneakyThrows
public long getFileSize(Path path) {
return Files.size(path);
}
@SneakyThrows
private String testOCR(String fileName) {
if (fileName.startsWith("files")) {
ClassPathResource pdfFileResource = new ClassPathResource(fileName);
return testOCR(pdfFileResource.getFile());
} else {
return testOCR(new File(fileName));
}
}
@SneakyThrows
private String testOCR(File file) {
Path tmpDir = Path.of(OsUtils.getTemporaryDirectory()).resolve("OCR_TEST").resolve(file.toPath().getFileName());
assert tmpDir.toFile().exists() || tmpDir.toFile().mkdirs();
var documentFile = tmpDir.resolve(Path.of("document.pdf"));
var viewerDocumentFile = tmpDir.resolve(Path.of("viewerDocument.pdf"));
var analyzeResultFile = tmpDir.resolve(Path.of("azureAnalysisResult.json"));
Files.copy(file.toPath(), documentFile, StandardCopyOption.REPLACE_EXISTING);
Files.copy(file.toPath(), viewerDocumentFile, StandardCopyOption.REPLACE_EXISTING);
ocrService.runOcrOnDocument(TEST_DOSSIER_ID, "file", false, tmpDir, documentFile.toFile(), viewerDocumentFile.toFile(), analyzeResultFile.toFile());
System.out.println("File:" + documentFile);
System.out.println("\n\n");
try (var fileStream = new FileInputStream(documentFile.toFile())) {
return extractAllTextFromDocument(fileStream);
}
}
}

View File

@ -0,0 +1,20 @@
persistence-service.url: "http://persistence-service-v1:8080"
pdftron.license: demo:1650351709282:7bd235e003000000004ec28a6743e1163a085e2115de2536ab6e2cfe5a
azure:
endpoint: https://ff-ocr-test.cognitiveservices.azure.com/
key: # find key in Bitwarden under: Azure IDP Test Key
logging.type: ${LOGGING_TYPE:CONSOLE}
ocrService.sendStatusUpdates: false
management:
endpoint:
metrics.enabled: true
prometheus.enabled: true
health.enabled: true
endpoints.web.exposure.include: prometheus, health, metrics
metrics.export.prometheus.enabled: true
POD_NAME: azure-ocr-service

View File

@ -0,0 +1,16 @@
<Configuration>
<Appenders>
<Console name="CONSOLE" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
</Appenders>
<Loggers>
<Root level="warn">
<AppenderRef ref="CONSOLE"/>
</Root>
<Logger name="com.knecon" level="info"/>
</Loggers>
</Configuration>

15
buildSrc/build.gradle.kts Normal file
View File

@ -0,0 +1,15 @@
/*
* This file was generated by the Gradle 'init' task.
*
* This project uses @Incubating APIs which are subject to change.
*/
plugins {
// Support convention plugins written in Kotlin. Convention plugins are build scripts in 'src/main' that automatically become available as plugins in the main build.
`kotlin-dsl`
}
repositories {
// Use the plugin portal to apply community plugins in convention plugins.
gradlePluginPortal()
}

View File

@ -0,0 +1,86 @@
plugins {
`java-library`
`maven-publish`
pmd
checkstyle
jacoco
}
group = "com.knecon.fforesight"
java.sourceCompatibility = JavaVersion.VERSION_17
java.targetCompatibility = JavaVersion.VERSION_17
tasks.pmdMain {
pmd.ruleSetFiles = files("${rootDir}/config/pmd/pmd.xml")
}
tasks.pmdTest {
pmd.ruleSetFiles = files("${rootDir}/config/pmd/test_pmd.xml")
}
tasks.named<Test>("test") {
useJUnitPlatform()
reports {
junitXml.outputLocation.set(layout.buildDirectory.dir("reports/junit"))
}
minHeapSize = "512m"
maxHeapSize = "2048m"
}
tasks.test {
finalizedBy(tasks.jacocoTestReport) // report is always generated after tests run
}
tasks.jacocoTestReport {
dependsOn(tasks.test) // tests are required to run before generating the report
reports {
xml.required.set(true)
csv.required.set(false)
html.outputLocation.set(layout.buildDirectory.dir("jacocoHtml"))
}
}
allprojects {
tasks.withType<Javadoc> {
options {
this as StandardJavadocDocletOptions
addBooleanOption("Xdoclint:none", true)
addStringOption("Xmaxwarns", "1")
}
}
publishing {
publications {
create<MavenPublication>(name) {
from(components["java"])
}
}
repositories {
maven {
url = uri("https://nexus.knecon.com/repository/red-platform-releases/")
credentials {
username = providers.gradleProperty("mavenUser").getOrNull()
password = providers.gradleProperty("mavenPassword").getOrNull()
}
}
}
}
}
java {
withJavadocJar()
}
repositories {
mavenLocal()
mavenCentral()
maven {
url = uri("https://nexus.knecon.com/repository/gindev/")
credentials {
username = providers.gradleProperty("mavenUser").getOrNull()
password = providers.gradleProperty("mavenPassword").getOrNull()
}
}
}

View File

@ -0,0 +1,39 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE module PUBLIC "-//Puppy Crawl//DTD Check Configuration 1.3//EN"
"http://www.puppycrawl.com/dtds/configuration_1_3.dtd">
<module name="Checker">
<property
name="severity"
value="error"/>
<module name="TreeWalker">
<module name="SuppressWarningsHolder"/>
<module name="MissingDeprecated"/>
<module name="MissingOverride"/>
<module name="AnnotationLocation"/>
<module name="JavadocStyle"/>
<module name="NonEmptyAtclauseDescription"/>
<module name="IllegalImport"/>
<module name="RedundantImport"/>
<module name="RedundantModifier"/>
<module name="EmptyBlock"/>
<module name="DefaultComesLast"/>
<module name="EmptyStatement"/>
<module name="EqualsHashCode"/>
<module name="ExplicitInitialization"/>
<module name="IllegalInstantiation"/>
<module name="ModifiedControlVariable"/>
<module name="MultipleVariableDeclarations"/>
<module name="PackageDeclaration"/>
<module name="ParameterAssignment"/>
<module name="SimplifyBooleanExpression"/>
<module name="SimplifyBooleanReturn"/>
<module name="StringLiteralEquality"/>
<module name="OneStatementPerLine"/>
<module name="FinalClass"/>
<module name="ArrayTypeStyle"/>
<module name="UpperEll"/>
<module name="OuterTypeFilename"/>
</module>
<module name="FileTabCharacter"/>
<module name="SuppressWarningsFilter"/>
</module>

20
config/pmd/pmd.xml Normal file
View File

@ -0,0 +1,20 @@
<?xml version="1.0"?>
<ruleset name="Custom ruleset"
xmlns="http://pmd.sourceforge.net/ruleset/2.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 http://pmd.sourceforge.net/ruleset_2_0_0.xsd">
<description>
Knecon ruleset checks the code for bad stuff
</description>
<rule ref="category/java/errorprone.xml">
<exclude name="MissingSerialVersionUID"/>
<exclude name="AvoidLiteralsInIfCondition"/>
<exclude name="AvoidDuplicateLiterals"/>
<exclude name="NullAssignment"/>
<exclude name="AssignmentInOperand"/>
<exclude name="BeanMembersShouldSerialize"/>
</rule>
</ruleset>

22
config/pmd/test_pmd.xml Normal file
View File

@ -0,0 +1,22 @@
<?xml version="1.0"?>
<ruleset name="Custom ruleset"
xmlns="http://pmd.sourceforge.net/ruleset/2.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://pmd.sourceforge.net/ruleset/2.0.0 http://pmd.sourceforge.net/ruleset_2_0_0.xsd">
<description>
Knecon test ruleset checks the code for bad stuff
</description>
<rule ref="category/java/errorprone.xml">
<exclude name="MissingSerialVersionUID"/>
<exclude name="AvoidLiteralsInIfCondition"/>
<exclude name="AvoidDuplicateLiterals"/>
<exclude name="NullAssignment"/>
<exclude name="AssignmentInOperand"/>
<exclude name="TestClassWithoutTestCases"/>
<exclude name="BeanMembersShouldSerialize"/>
</rule>
</ruleset>

1
gradle.properties.kts Normal file
View File

@ -0,0 +1 @@
version = 0.1-SNAPSHOT

46
publish-custom-image.sh Executable file
View File

@ -0,0 +1,46 @@
#!/bin/bash
set -e
dir=${PWD##*/}
gradle assemble
# Get the current Git branch
branch=$(git rev-parse --abbrev-ref HEAD)
# Get the short commit hash (first 5 characters)
commit_hash=$(git rev-parse --short=5 HEAD)
# Combine branch and commit hash
buildName="${USER}-${branch}-${commit_hash}"
gradle bootBuildImage --publishImage -PbuildbootDockerHostNetwork=true -Pversion=${buildName}
newImageName="nexus.knecon.com:5001/ff/azure-ocr-service-server:$buildName"
echo "full image name:"
echo ${newImageName}
echo ""
if [ -z "$1" ]; then
exit 0
fi
namespace=${1}
deployment_name="ocr-service-v1"
echo "deploying to ${namespace}"
oldImageName=$(rancher kubectl -n ${namespace} get deployment ${deployment_name} -o=jsonpath='{.spec.template.spec.containers[*].image}')
if [ "${newImageName}" = "${oldImageName}" ]; then
echo "Image tag did not change, redeploying..."
rancher kubectl rollout restart deployment ${deployment_name} -n ${namespace}
else
echo "upgrading the image tag..."
rancher kubectl set image deployment/${deployment_name} ${deployment_name}=${newImageName} -n ${namespace}
fi
rancher kubectl rollout status deployment ${deployment_name} -n ${namespace}
echo "Built ${deployment_name}:${buildName} and deployed to ${namespace}"

7
settings.gradle.kts Normal file
View File

@ -0,0 +1,7 @@
rootProject.name = "azure-ocr-service"
include(":azure-ocr-service-api")
include(":azure-ocr-service-server")
include(":azure-ocr-service-processor")
project(":azure-ocr-service-api").projectDir = file("azure-ocr-service/azure-ocr-service-api")
project(":azure-ocr-service-server").projectDir = file("azure-ocr-service/azure-ocr-service-server")
project(":azure-ocr-service-processor").projectDir = file("azure-ocr-service/azure-ocr-service-processor")