Compare commits

...

2 Commits

Author SHA1 Message Date
yhampe
0f11d901f6 RED-7964: Prototype Visual Layout Parsing:
added layer for visual layout parsing results (true by default)
added label to drawn boxes
2023-12-21 08:54:03 +01:00
yhampe
be2350b289 RED-7964: Prototype Visual Layout Parsing:
drawing result from visual layout parsing into viewer document
2023-12-12 15:28:17 +01:00
67 changed files with 2526 additions and 238 deletions

View File

@ -1,39 +1,44 @@
# PDF Layout Parser Micro-Service: layout-parser # PDF Layout Parser Micro-Service: layout-parser
## Introduction ## Introduction
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache
PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser
micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
### Key Steps in the PDF Layout Parsing Process: ### Key Steps in the PDF Layout Parsing Process:
* **Text Position Extraction:** * **Text Position Extraction:**
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document. The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
* **Word Segmentation and Text Block Formation:** * **Word Segmentation and Text Block Formation:**
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks. Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
* **Text Block Classification:** * **Text Block Classification:**
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties, distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers. The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties,
distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
* **Layout Coherence Establishment:** * **Layout Coherence Establishment:**
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other elements in a logical and structured manner. The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other
elements in a logical and structured manner.
* **Output Generation in Various Formats:** * **Output Generation in Various Formats:**
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption. Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream
micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
### Optional Enhancements: ### Optional Enhancements:
* **ML-Based Table Extraction:** * **ML-Based Table Extraction:**
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON file, which are then integrated seamlessly into the layout structure. For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON
file, which are then integrated seamlessly into the layout structure.
* **Image Classification using ML:** * **Image Classification using ML:**
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification. Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction
option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It
proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream
micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
## Installation ## Installation
@ -49,41 +54,55 @@ To build and test the micro-service, follow these steps:
### Clone the Repository: ### Clone the Repository:
bash bash
``` ```
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
cd layout-parser cd layout-parser
``` ```
### Build the Project: ### Build the Project:
Use the following command to build the project using Gradle: Use the following command to build the project using Gradle:
``` ```
gradle clean build gradle clean build
``` ```
### Run Tests: ### Run Tests:
Run the test suite using the following command: Run the test suite using the following command:
``` ```
gradle test gradle test
``` ```
## Building a Custom Docker Image ## Building a Custom Docker Image
To create a custom Docker image for the layout-parser micro-service, execute the provided script: To create a custom Docker image for the layout-parser micro-service, execute the provided script:
### Ensure Docker is Installed: ### Ensure Docker is Installed:
Ensure that Docker is installed and running on your system. Ensure that Docker is installed and running on your system.
### Run the Image Building Script: ### Run the Image Building Script:
Execute the publish-custom-image script in the project directory: Execute the publish-custom-image script in the project directory:
``` ```
./publish-custom-image ./publish-custom-image
``` ```
## Publishing to Internal Maven Repository ## Publishing to Internal Maven Repository
To publish the layout-parser micro-service to your internal Maven repository, execute the following command: To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
``` ```
gradle -Pversion=buildVersion publish gradle -Pversion=buildVersion publish
``` ```
Replace buildVersion with the desired version number. Replace buildVersion with the desired version number.
## Additional Notes ## Additional Notes
Make sure to configure any necessary application properties before deploying the micro-service. Make sure to configure any necessary application properties before deploying the micro-service.
For advanced usage and configurations, refer to Kilian or Dom or preferably the source code. For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.

View File

@ -1 +1 @@
version = 0.1-SNAPSHOT version = 0.1 - SNAPSHOT

View File

@ -25,7 +25,6 @@ public class DocumentStructure {
@Schema(description = "The root EntryData represents the Document.") @Schema(description = "The root EntryData represents the Document.")
EntryData root; EntryData root;
@Schema(description = "Object containing the extra field names, a table has in its properties field.") @Schema(description = "Object containing the extra field names, a table has in its properties field.")
public static class TableProperties { public static class TableProperties {
@ -56,6 +55,7 @@ public class DocumentStructure {
public static final String RECTANGLE_DELIMITER = ";"; public static final String RECTANGLE_DELIMITER = ";";
public static Rectangle2D parseRectangle2D(String bBox) { public static Rectangle2D parseRectangle2D(String bBox) {
List<Float> floats = Arrays.stream(bBox.split(RECTANGLE_DELIMITER)).map(Float::parseFloat).toList(); List<Float> floats = Arrays.stream(bBox.split(RECTANGLE_DELIMITER)).map(Float::parseFloat).toList();

View File

@ -17,4 +17,5 @@ public class RowData {
List<ParagraphData> cellText; List<ParagraphData> cellText;
@Schema(description = "The bounding box of this StructureObject. Is always exactly 4 values representing x, y, w, h, where x, y specify the lower left corner.") @Schema(description = "The bounding box of this StructureObject. Is always exactly 4 values representing x, y, w, h, where x, y specify the lower left corner.")
float[] bBox; float[] bBox;
} }

View File

@ -8,13 +8,9 @@ import lombok.Builder;
@Builder @Builder
@Schema(description = "Object containing information about the layout parsing.") @Schema(description = "Object containing information about the layout parsing.")
public record LayoutParsingFinishedEvent( public record LayoutParsingFinishedEvent(
@Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.") @Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.") Map<String, String> identifier,//
Map<String, String> identifier,// @Schema(description = "The duration of a single layout parsing in ms.") long duration,//
@Schema(description = "The duration of a single layout parsing in ms.") @Schema(description = "The number of pages of the parsed document.") int numberOfPages,//
long duration,// @Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.") String message) {
@Schema(description = "The number of pages of the parsed document.")
int numberOfPages,//
@Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.")
String message) {
} }

View File

@ -5,4 +5,5 @@ public class LayoutParsingQueueNames {
public static final String LAYOUT_PARSING_REQUEST_QUEUE = "layout_parsing_request_queue"; public static final String LAYOUT_PARSING_REQUEST_QUEUE = "layout_parsing_request_queue";
public static final String LAYOUT_PARSING_DLQ = "layout_parsing_dead_letter_queue"; public static final String LAYOUT_PARSING_DLQ = "layout_parsing_dead_letter_queue";
public static final String LAYOUT_PARSING_FINISHED_EVENT_QUEUE = "layout_parsing_response_queue"; public static final String LAYOUT_PARSING_FINISHED_EVENT_QUEUE = "layout_parsing_response_queue";
} }

View File

@ -17,24 +17,35 @@ public record LayoutParsingRequest(
Map<String, String> identifier, Map<String, String> identifier,
@Schema(description = "Path to the original PDF file.")// @Schema(description = "Path to the original PDF file.")//
@NonNull String originFileStorageId,// @NonNull String originFileStorageId,
//
@Schema(description = "Optional Path to the the visual layout parsing service file") Optional<String> visualLayoutParsingFileId,
@Schema(description = "Optional Path to the table extraction file.")// @Schema(description = "Optional Path to the table extraction file.")//
Optional<String> tablesFileStorageId,// Optional<String> tablesFileStorageId,
//
@Schema(description = "Optional Path to the image classification file.")// @Schema(description = "Optional Path to the image classification file.")//
Optional<String> imagesFileStorageId,// Optional<String> imagesFileStorageId,
//
@Schema(description = "Path where the Document Structure File will be stored.")// @Schema(description = "Path where the Document Structure File will be stored.")//
@NonNull String structureFileStorageId,// @NonNull String structureFileStorageId,
//
@Schema(description = "Path where the Research Data File will be stored.")// @Schema(description = "Path where the Research Data File will be stored.")//
String researchDocumentStorageId,// String researchDocumentStorageId,
//
@Schema(description = "Path where the Document Text File will be stored.")// @Schema(description = "Path where the Document Text File will be stored.")//
@NonNull String textBlockFileStorageId,// @NonNull String textBlockFileStorageId,
//
@Schema(description = "Path where the Document Positions File will be stored.")// @Schema(description = "Path where the Document Positions File will be stored.")//
@NonNull String positionBlockFileStorageId,// @NonNull String positionBlockFileStorageId,
//
@Schema(description = "Path where the Document Pages File will be stored.")// @Schema(description = "Path where the Document Pages File will be stored.")//
@NonNull String pageFileStorageId,// @NonNull String pageFileStorageId,
//
@Schema(description = "Path where the Simplified Text File will be stored.")// @Schema(description = "Path where the Simplified Text File will be stored.")//
@NonNull String simplifiedTextStorageId,// @NonNull String simplifiedTextStorageId,
//
@Schema(description = "Path where the Viewer Document PDF will be stored.")// @Schema(description = "Path where the Viewer Document PDF will be stored.")//
@NonNull String viewerDocumentStorageId) { @NonNull String viewerDocumentStorageId) {

View File

@ -30,9 +30,12 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageB
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.CvTableParsingAdapter; import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.CvTableParsingAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.ImageServiceResponseAdapter; import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.ImageServiceResponseAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.VisualLayoutParsingAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService; import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
import com.knecon.fforesight.service.layoutparser.processor.services.RulingCleaningService; import com.knecon.fforesight.service.layoutparser.processor.services.RulingCleaningService;
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService; import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
@ -71,6 +74,7 @@ public class LayoutParsingPipeline {
private final BodyTextFrameService bodyTextFrameService; private final BodyTextFrameService bodyTextFrameService;
private final RulingCleaningService rulingCleaningService; private final RulingCleaningService rulingCleaningService;
private final TableExtractionService tableExtractionService; private final TableExtractionService tableExtractionService;
private final VisualLayoutParsingAdapter visualLayoutParsingAdapter;
private final TaasBlockificationService taasBlockificationService; private final TaasBlockificationService taasBlockificationService;
private final DocuMineBlockificationService docuMineBlockificationService; private final DocuMineBlockificationService docuMineBlockificationService;
private final RedactManagerBlockificationService redactManagerBlockificationService; private final RedactManagerBlockificationService redactManagerBlockificationService;
@ -92,6 +96,11 @@ public class LayoutParsingPipeline {
tableServiceResponse = layoutParsingStorageService.getTablesFile(layoutParsingRequest.tablesFileStorageId().get()); tableServiceResponse = layoutParsingStorageService.getTablesFile(layoutParsingRequest.tablesFileStorageId().get());
} }
VisualLayoutParsingResponse visualLayoutParsingResponse = new VisualLayoutParsingResponse();
if (layoutParsingRequest.visualLayoutParsingFileId().isPresent()) {
visualLayoutParsingResponse = layoutParsingStorageService.getExtractedTablesFile(layoutParsingRequest.visualLayoutParsingFileId().get());
}
ClassificationDocument classificationDocument = parseLayout(layoutParsingRequest.layoutParsingType(), originDocument, imageServiceResponse, tableServiceResponse); ClassificationDocument classificationDocument = parseLayout(layoutParsingRequest.layoutParsingType(), originDocument, imageServiceResponse, tableServiceResponse);
Document documentGraph = DocumentGraphFactory.buildDocumentGraph(classificationDocument); Document documentGraph = DocumentGraphFactory.buildDocumentGraph(classificationDocument);
@ -100,8 +109,9 @@ public class LayoutParsingPipeline {
layoutParsingStorageService.storeDocumentData(layoutParsingRequest, DocumentDataMapper.toDocumentData(documentGraph)); layoutParsingStorageService.storeDocumentData(layoutParsingRequest, DocumentDataMapper.toDocumentData(documentGraph));
layoutParsingStorageService.storeSimplifiedText(layoutParsingRequest, simplifiedSectionTextService.toSimplifiedText(documentGraph)); layoutParsingStorageService.storeSimplifiedText(layoutParsingRequest, simplifiedSectionTextService.toSimplifiedText(documentGraph));
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells = visualLayoutParsingAdapter.buildExtractedTablesPerPage(visualLayoutParsingResponse);
try (var out = new ByteArrayOutputStream()) { try (var out = new ByteArrayOutputStream()) {
viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, false); viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, extractedTableCells, false);
layoutParsingStorageService.storeViewerDocument(layoutParsingRequest, out); layoutParsingStorageService.storeViewerDocument(layoutParsingRequest, out);
} }
@ -244,9 +254,9 @@ public class LayoutParsingPipeline {
private void increaseDocumentStatistics(ClassificationPage classificationPage, ClassificationDocument document) { private void increaseDocumentStatistics(ClassificationPage classificationPage, ClassificationDocument document) {
if (!classificationPage.isLandscape()) { if (!classificationPage.isLandscape()) {
document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue()); document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue());
} }
document.getFontCounter().addAll(classificationPage.getFontCounter().getCountPerValue()); document.getFontCounter().addAll(classificationPage.getFontCounter().getCountPerValue());
document.getTextHeightCounter().addAll(classificationPage.getTextHeightCounter().getCountPerValue()); document.getTextHeightCounter().addAll(classificationPage.getTextHeightCounter().getCountPerValue());
document.getFontStyleCounter().addAll(classificationPage.getFontStyleCounter().getCountPerValue()); document.getFontStyleCounter().addAll(classificationPage.getFontStyleCounter().getCountPerValue());

View File

@ -24,6 +24,7 @@ import com.knecon.fforesight.service.layoutparser.internal.api.data.taas.Researc
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingRequest; import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingRequest;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.tenantcommons.TenantContext; import com.knecon.fforesight.tenantcommons.TenantContext;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -74,6 +75,16 @@ public class LayoutParsingStorageService {
} }
public VisualLayoutParsingResponse getExtractedTablesFile(String storageId) throws IOException {
try (InputStream inputStream = getObject(storageId)) {
VisualLayoutParsingResponse visualLayoutParsingResponse = objectMapper.readValue(inputStream, VisualLayoutParsingResponse.class);
inputStream.close();
return visualLayoutParsingResponse;
}
}
public void storeDocumentData(LayoutParsingRequest layoutParsingRequest, DocumentData documentData) { public void storeDocumentData(LayoutParsingRequest layoutParsingRequest, DocumentData documentData) {
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.structureFileStorageId(), documentData.getDocumentStructure()); storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.structureFileStorageId(), documentData.getDocumentStructure());
@ -83,7 +94,6 @@ public class LayoutParsingStorageService {
} }
public void storeResearchDocumentData(LayoutParsingRequest layoutParsingRequest, ResearchDocumentData researchDocumentData) { public void storeResearchDocumentData(LayoutParsingRequest layoutParsingRequest, ResearchDocumentData researchDocumentData) {
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.researchDocumentStorageId(), researchDocumentData); storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.researchDocumentStorageId(), researchDocumentData);

View File

@ -14,7 +14,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.StringFre
import lombok.Data; import lombok.Data;
import lombok.NonNull; import lombok.NonNull;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
@Data @Data
@RequiredArgsConstructor @RequiredArgsConstructor

View File

@ -19,4 +19,5 @@ public class PageContents {
Rectangle2D cropBox; Rectangle2D cropBox;
Rectangle2D mediaBox; Rectangle2D mediaBox;
List<Ruling> rulings; List<Ruling> rulings;
} }

View File

@ -108,11 +108,13 @@ public class Boundary implements Comparable<Boundary> {
return splitBoundaries; return splitBoundaries;
} }
public IntStream intStream() { public IntStream intStream() {
return IntStream.range(start, end); return IntStream.range(start, end);
} }
public static Boundary merge(Collection<Boundary> boundaries) { public static Boundary merge(Collection<Boundary> boundaries) {
int minStart = boundaries.stream().mapToInt(Boundary::start).min().orElseThrow(IllegalArgumentException::new); int minStart = boundaries.stream().mapToInt(Boundary::start).min().orElseThrow(IllegalArgumentException::new);

View File

@ -105,6 +105,7 @@ public class Document implements GenericSemanticNode {
return streamAllSubNodes().collect(Collectors.groupingBy(SemanticNode::getType, Collectors.counting())); return streamAllSubNodes().collect(Collectors.groupingBy(SemanticNode::getType, Collectors.counting()));
} }
@Override @Override
public String toString() { public String toString() {

View File

@ -207,6 +207,7 @@ public class Table implements SemanticNode {
return IntStream.range(0, numberOfCols).boxed().map(col -> getCell(row, col)); return IntStream.range(0, numberOfCols).boxed().map(col -> getCell(row, col));
} }
/** /**
* Streams all TableCells row-wise and filters them with header == true. * Streams all TableCells row-wise and filters them with header == true.
* *

View File

@ -109,10 +109,7 @@ public class AtomicTextBlock implements TextBlock {
} }
public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData, public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData, DocumentPositionData documentPositionData, SemanticNode parent, Page page) {
DocumentPositionData documentPositionData,
SemanticNode parent,
Page page) {
return AtomicTextBlock.builder() return AtomicTextBlock.builder()
.id(documentTextData.getId()) .id(documentTextData.getId())

View File

@ -1,14 +1,12 @@
package com.knecon.fforesight.service.layoutparser.processor.model.table; package com.knecon.fforesight.service.layoutparser.processor.model.table;
import java.awt.geom.Point2D; import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.Collections; import java.util.Collections;
import java.util.HashSet; import java.util.HashSet;
import java.util.List; import java.util.List;
import java.util.Set; import java.util.Set;
import java.util.TreeMap; import java.util.TreeMap;
import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType; import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
@ -50,6 +48,7 @@ public class TablePageBlock extends AbstractPageBlock {
return getColCount() == 0 || getRowCount() == 0; return getColCount() == 0 || getRowCount() == 0;
} }
public List<List<Cell>> getRows() { public List<List<Cell>> getRows() {
if (rows == null) { if (rows == null) {
@ -276,21 +275,17 @@ public class TablePageBlock extends AbstractPageBlock {
} }
public boolean intersects(Cell cell1, Cell cell2) { public boolean intersects(Cell cell1, Cell cell2) {
if (cell1.getHeight() <= 0 || cell2.getHeight() <= 0) { if (cell1.getHeight() <= 0 || cell2.getHeight() <= 0) {
return false; return false;
} }
double x0 = cell1.getX() + 2; double x0 = cell1.getX() + 2;
double y0 = cell1.getY() + 2; double y0 = cell1.getY() + 2;
return (cell2.x + cell2.width > x0 && return (cell2.x + cell2.width > x0 && cell2.y + cell2.height > y0 && cell2.x < x0 + cell1.getWidth() - 2 && cell2.y < y0 + cell1.getHeight() - 2);
cell2.y + cell2.height > y0 &&
cell2.x < x0 + cell1.getWidth() -2 &&
cell2.y < y0 + cell1.getHeight() -2);
} }
@Override @Override
public String getText() { public String getText() {
@ -328,8 +323,6 @@ public class TablePageBlock extends AbstractPageBlock {
} }
public String getTextAsHtml() { public String getTextAsHtml() {
StringBuilder sb = new StringBuilder(); StringBuilder sb = new StringBuilder();

View File

@ -2,6 +2,7 @@ package com.knecon.fforesight.service.layoutparser.processor.model.text;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.List; import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextNormalizationUtilities; import com.knecon.fforesight.service.layoutparser.processor.utils.TextNormalizationUtilities;
import lombok.Getter; import lombok.Getter;

View File

@ -82,6 +82,7 @@ public class TextPageBlock extends AbstractPageBlock {
return fromTextPositionSequences(sequences); return fromTextPositionSequences(sequences);
} }
public static TextPageBlock fromTextPositionSequences(List<TextPositionSequence> wordBlockList) { public static TextPageBlock fromTextPositionSequences(List<TextPositionSequence> wordBlockList) {
TextPageBlock textBlock = null; TextPageBlock textBlock = null;
@ -133,7 +134,6 @@ public class TextPageBlock extends AbstractPageBlock {
} }
/** /**
* Returns the minX value in pdf coordinate system. * Returns the minX value in pdf coordinate system.
* Note: This needs to use Pdf Coordinate System where {0,0} rotated with the page rotation. * Note: This needs to use Pdf Coordinate System where {0,0} rotated with the page rotation.

View File

@ -234,6 +234,7 @@ public class TextPositionSequence implements CharSequence {
@JsonIgnore @JsonIgnore
@JsonAttribute(ignore = true) @JsonAttribute(ignore = true)
public String getFontStyle() { public String getFontStyle() {
if (textPositions.get(0).getFontName() == null) { if (textPositions.get(0).getFontName() == null) {
return "standard"; return "standard";
} }

View File

@ -9,10 +9,10 @@ import java.util.Map;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -20,8 +20,7 @@ import lombok.RequiredArgsConstructor;
@RequiredArgsConstructor @RequiredArgsConstructor
public class ImageServiceResponseAdapter { public class ImageServiceResponseAdapter {
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse) {
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse ) {
Map<Integer, List<ClassifiedImage>> images = new HashMap<>(); Map<Integer, List<ClassifiedImage>> images = new HashMap<>();
imageServiceResponse.getData().forEach(imageMetadata -> { imageServiceResponse.getData().forEach(imageMetadata -> {

View File

@ -0,0 +1,52 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.adapter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingBox;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import lombok.RequiredArgsConstructor;
@Service
@RequiredArgsConstructor
public class VisualLayoutParsingAdapter {
public Map<Integer, List<VisualLayoutParsingResult>> buildExtractedTablesPerPage(VisualLayoutParsingResponse visualLayoutParsingResponse) {
Map<Integer, List<VisualLayoutParsingResult>> tableCells = new HashMap<>();
visualLayoutParsingResponse.getData()
.forEach(tableData -> tableCells.computeIfAbsent(tableData.getPage_idx(), tableCell -> new ArrayList<>()).addAll(convertTableCells(tableData.getBoxes())));
return tableCells;
}
public List<VisualLayoutParsingResult> convertTableCells(List<VisualLayoutParsingBox> tableObjects) {
List<VisualLayoutParsingResult> parsedTableCells = new ArrayList<>();
tableObjects.stream().forEach(t -> {
VisualLayoutParsingResult result = new VisualLayoutParsingResult();
result.setX0(t.getBox().getX1());
result.setX1(t.getBox().getX2());
result.setY0(t.getBox().getY1());
result.setY1(t.getBox().getY2());
result.setWidth(result.getX1() - result.getX0());
result.setHeight(result.getY1() - result.getY0());
result.setLabel(t.getLabel());
parsedTableCells.add(result);
});
return parsedTableCells;
}
}

View File

@ -0,0 +1,5 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
public class ExtractedTable {
}

View File

@ -0,0 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingBox {
private VisualLayoutParsingBoxValue box;
private String label;
private float probability;
}

View File

@ -0,0 +1,19 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingBoxValue {
private float x1;
private float y1;
private float x2;
private float y2;
}

View File

@ -0,0 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingData {
private int page_idx;
private List<VisualLayoutParsingBox> boxes;
}

View File

@ -0,0 +1,23 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingResponse {
private String dossierId;
private String fileId;
private String targetFileExtension;
private String responseFileExtension;
private String X_TENANT_ID;
private List<VisualLayoutParsingData> data;
}

View File

@ -0,0 +1,22 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingResult {
private float x0;
private float y0;
private float x1;
private float y1;
private float width;
private float height;
private String label;
}

View File

@ -25,6 +25,7 @@ public class BodyTextFrameService {
private static final float RULING_HEIGHT_THRESHOLD = 0.15f; // multiplied with page height. Header/Footer Rulings must be within that border of the page. private static final float RULING_HEIGHT_THRESHOLD = 0.15f; // multiplied with page height. Header/Footer Rulings must be within that border of the page.
private static final float RULING_WIDTH_THRESHOLD = 0.75f; // multiplied with page width. Header/Footer Rulings must be at least that wide. private static final float RULING_WIDTH_THRESHOLD = 0.75f; // multiplied with page width. Header/Footer Rulings must be at least that wide.
public void setBodyTextFrames(ClassificationDocument classificationDocument, LayoutParsingType layoutParsingType) { public void setBodyTextFrames(ClassificationDocument classificationDocument, LayoutParsingType layoutParsingType) {
Rectangle bodyTextFrame = calculateBodyTextFrame(classificationDocument.getPages(), classificationDocument.getFontSizeCounter(), false, layoutParsingType); Rectangle bodyTextFrame = calculateBodyTextFrame(classificationDocument.getPages(), classificationDocument.getFontSizeCounter(), false, layoutParsingType);
@ -155,8 +156,9 @@ public class BodyTextFrameService {
continue; continue;
} }
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || MarkedContentUtils.intersects(textBlock,
|| MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)) { page.getMarkedContentBboxPerType(),
MarkedContentUtils.FOOTER)) {
continue; continue;
} }

View File

@ -22,7 +22,6 @@ public class DividingColumnDetectionService {
public List<Rectangle2D> detectColumns(PageContents pageContents) { public List<Rectangle2D> detectColumns(PageContents pageContents) {
if (pageContents.getSortedTextPositionSequences().size() < 2) { if (pageContents.getSortedTextPositionSequences().size() < 2) {
return List.of(pageContents.getCropBox()); return List.of(pageContents.getCropBox());
} }

View File

@ -72,11 +72,13 @@ public class GapDetectionService {
return mirrorY(RectangleTransformations.toRectangle2D(textPosition.getRectangle())); return mirrorY(RectangleTransformations.toRectangle2D(textPosition.getRectangle()));
} }
private static Rectangle2D mirrorY(Rectangle2D rectangle2D) { private static Rectangle2D mirrorY(Rectangle2D rectangle2D) {
return new Rectangle2D.Double(rectangle2D.getX(), Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY()), rectangle2D.getWidth(), Math.abs(rectangle2D.getHeight())); return new Rectangle2D.Double(rectangle2D.getX(), Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY()), rectangle2D.getWidth(), Math.abs(rectangle2D.getHeight()));
} }
private static void addGapToLine(Rectangle2D currentTextPosition, Rectangle2D previousTextPosition, XGapsContext context) { private static void addGapToLine(Rectangle2D currentTextPosition, Rectangle2D previousTextPosition, XGapsContext context) {
context.gapsInCurrentLine.add(new Rectangle2D.Double(previousTextPosition.getMaxX(), context.gapsInCurrentLine.add(new Rectangle2D.Double(previousTextPosition.getMaxX(),

View File

@ -6,7 +6,6 @@ import java.util.LinkedList;
import java.util.List; import java.util.List;
import java.util.Queue; import java.util.Queue;
import java.util.stream.Stream; import java.util.stream.Stream;
import com.iqser.red.commons.jackson.ObjectMapperFactory;
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation; import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
@ -51,7 +50,9 @@ public class GapsAcrossLinesService {
} }
return columnFactory.outputGaps.stream() return columnFactory.outputGaps.stream()
.filter(gapAcrossLines -> columnFactory.outputGaps.stream().filter(gapAcrossLines::intersectsX).noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount)) .filter(gapAcrossLines -> columnFactory.outputGaps.stream()
.filter(gapAcrossLines::intersectsX)
.noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount))
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMinX() - mainBodyTextFrame.getMinX()) > DISTANCE_TO_BORDER_THRESHOLD) .filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMinX() - mainBodyTextFrame.getMinX()) > DISTANCE_TO_BORDER_THRESHOLD)
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMaxX() - mainBodyTextFrame.getMaxX()) > DISTANCE_TO_BORDER_THRESHOLD) .filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMaxX() - mainBodyTextFrame.getMaxX()) > DISTANCE_TO_BORDER_THRESHOLD)
.map(GapAcrossLines::getRectangle2D) .map(GapAcrossLines::getRectangle2D)

View File

@ -6,8 +6,8 @@ import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation; import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
import com.knecon.fforesight.service.layoutparser.processor.model.LineInformation; import com.knecon.fforesight.service.layoutparser.processor.model.LineInformation;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import lombok.AllArgsConstructor; import lombok.AllArgsConstructor;
import lombok.Getter; import lombok.Getter;

View File

@ -16,8 +16,7 @@ public class MainBodyTextFrameExtractionService {
public Rectangle2D calculateMainBodyTextFrame(LineInformation lineInformation) { public Rectangle2D calculateMainBodyTextFrame(LineInformation lineInformation) {
Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream() Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream().collect(RectangleTransformations.collectBBox());
.collect(RectangleTransformations.collectBBox());
return RectangleTransformations.pad(mainBodyTextFrame, mainBodyTextFrame.getWidth() * TEXT_FRAME_PAD_WIDTH, mainBodyTextFrame.getHeight() * TEXT_FRAME_PAD_HEIGHT); return RectangleTransformations.pad(mainBodyTextFrame, mainBodyTextFrame.getWidth() * TEXT_FRAME_PAD_WIDTH, mainBodyTextFrame.getHeight() * TEXT_FRAME_PAD_HEIGHT);
} }

View File

@ -5,9 +5,9 @@ import java.util.List;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedSectionText; import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedSectionText;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
@Service @Service
public class SimplifiedSectionTextService { public class SimplifiedSectionTextService {
@ -23,4 +23,5 @@ public class SimplifiedSectionTextService {
return SimplifiedSectionText.builder().sectionNumber(section.getTreeId().get(0)).text(section.getTextBlock().getSearchText()).build(); return SimplifiedSectionText.builder().sectionNumber(section.getTreeId().get(0)).text(section.getTextBlock().getSearchText()).build();
} }
} }

View File

@ -1,9 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.services.blockification; package com.knecon.fforesight.service.layoutparser.processor.services.blockification;
// TODO: figure out, why this fails the build // TODO: figure out, why this fails the build
// import static com.knecon.fforesight.service.layoutparser.processor.services.factory.SearchTextWithTextPositionFactory.HEIGHT_PADDING; // import static com.knecon.fforesight.service.layoutparser.processor.services.factory.SearchTextWithTextPositionFactory.HEIGHT_PADDING;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Stream;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.Orientation; import com.knecon.fforesight.service.layoutparser.processor.model.Orientation;
@ -11,12 +22,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.table.Ruling;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RulingTextDirAdjustUtil; import com.knecon.fforesight.service.layoutparser.processor.utils.RulingTextDirAdjustUtil;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Stream;
@Service @Service
@SuppressWarnings("all") @SuppressWarnings("all")
@ -83,13 +88,13 @@ public class TaasBlockificationService {
continue; continue;
} }
Matcher listIdentifierPattern = listIdentifier.matcher(currentTextBlock.getText()); Matcher listIdentifierPattern = listIdentifier.matcher(currentTextBlock.getText());
boolean isListIdentifier = listIdentifierPattern.find(); boolean isListIdentifier = listIdentifierPattern.find();
boolean yGap = Math.abs(currentTextBlock.getPdfMaxY() - previousTextBlock.getPdfMinY()) < previousTextBlock.getMostPopularWordHeight() * Y_GAP_SPLIT_HEIGHT_MODIFIER; boolean yGap = Math.abs(currentTextBlock.getPdfMaxY() - previousTextBlock.getPdfMinY()) < previousTextBlock.getMostPopularWordHeight() * Y_GAP_SPLIT_HEIGHT_MODIFIER;
boolean sameFont = previousTextBlock.getMostPopularWordFont().equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize(); boolean sameFont = previousTextBlock.getMostPopularWordFont()
.equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize();
// boolean yGap = previousTextBlock != null && currentTextBlock.getMinYDirAdj() - maxY > Math.min(word.getHeight(), prev.getHeight()) * Y_GAP_SPLIT_HEIGHT_MODIFIER; // boolean yGap = previousTextBlock != null && currentTextBlock.getMinYDirAdj() - maxY > Math.min(word.getHeight(), prev.getHeight()) * Y_GAP_SPLIT_HEIGHT_MODIFIER;
boolean alignsXRight = Math.abs(currentTextBlock.getPdfMaxX() - previousTextBlock.getPdfMaxX()) < X_ALIGNMENT_THRESHOLD; boolean alignsXRight = Math.abs(currentTextBlock.getPdfMaxX() - previousTextBlock.getPdfMaxX()) < X_ALIGNMENT_THRESHOLD;
@ -119,8 +124,9 @@ public class TaasBlockificationService {
} }
alreadyMerged.add(textPageBlock); alreadyMerged.add(textPageBlock);
textBlocksToMerge.add(Stream.concat(Stream.of(textPageBlock), textBlocksToMerge.add(Stream.concat(Stream.of(textPageBlock),
textPageBlocks.stream().filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2)).peek(alreadyMerged::add)) textPageBlocks.stream()
.toList()); .filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2))
.peek(alreadyMerged::add)).toList());
} }
return textBlocksToMerge.stream().map(TextPageBlock::merge).toList(); return textBlocksToMerge.stream().map(TextPageBlock::merge).toList();
} }
@ -163,8 +169,7 @@ public class TaasBlockificationService {
while (itty.hasNext()) { while (itty.hasNext()) {
TextPageBlock block = (TextPageBlock) itty.next(); TextPageBlock block = (TextPageBlock) itty.next();
if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold( if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold(block.getMaxY(),
block.getMaxY(),
previous.getMaxY()) || previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation() previous.getMaxY()) || previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation()
.equals(Orientation.RIGHT) && equalsWithThreshold(block.getMaxY(), previous.getMaxY())) { .equals(Orientation.RIGHT) && equalsWithThreshold(block.getMaxY(), previous.getMaxY())) {
previous.add(block); previous.add(block);
@ -189,7 +194,6 @@ public class TaasBlockificationService {
TextPositionSequence prev = null; TextPositionSequence prev = null;
// TODO: make static final constant // TODO: make static final constant
boolean wasSplitted = false; boolean wasSplitted = false;
Float splitX1 = null; Float splitX1 = null;
for (TextPositionSequence word : textPositions) { for (TextPositionSequence word : textPositions) {

View File

@ -5,7 +5,6 @@ import java.util.Locale;
import java.util.regex.Matcher; import java.util.regex.Matcher;
import java.util.regex.Pattern; import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -13,6 +12,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType; import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils; import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -63,16 +63,16 @@ public class DocuMineClassificationService {
textBlock.setClassification(PageBlockType.OTHER); textBlock.setClassification(PageBlockType.OTHER);
return; return;
} }
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter() textBlock,
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular()) page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
) { .getMostPopular())) {
textBlock.setClassification(PageBlockType.HEADER); textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) } else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter() textBlock,
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular()) page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
) { .getMostPopular())) {
textBlock.setClassification(PageBlockType.FOOTER); textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock, } else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks() document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
import java.util.List; import java.util.List;
import java.util.regex.Pattern; import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -11,6 +10,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType; import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils; import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -21,7 +21,6 @@ import lombok.extern.slf4j.Slf4j;
@RequiredArgsConstructor @RequiredArgsConstructor
public class RedactManagerClassificationService { public class RedactManagerClassificationService {
public void classifyDocument(ClassificationDocument document) { public void classifyDocument(ClassificationDocument document) {
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular(); List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
@ -52,14 +51,16 @@ public class RedactManagerClassificationService {
textBlock.setClassification(PageBlockType.OTHER); textBlock.setClassification(PageBlockType.OTHER);
return; return;
} }
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter() textBlock,
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) { page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.HEADER); textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) } else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter() textBlock,
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) { page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.FOOTER); textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock, } else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks() document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
import java.util.List; import java.util.List;
import java.util.regex.Pattern; import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service; import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -12,6 +11,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType; import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService; import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils; import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
@ -27,7 +27,6 @@ public class TaasClassificationService {
public void classifyDocument(ClassificationDocument document) { public void classifyDocument(ClassificationDocument document) {
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular(); List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
log.debug("Document FontSize counters are: {}", document.getFontSizeCounter().getCountPerValue()); log.debug("Document FontSize counters are: {}", document.getFontSizeCounter().getCountPerValue());
@ -57,11 +56,13 @@ public class TaasClassificationService {
textBlock.setClassification(PageBlockType.OTHER); textBlock.setClassification(PageBlockType.OTHER);
return; return;
} }
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) { textBlock,
page.getRotation())) {
textBlock.setClassification(PageBlockType.HEADER); textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) } else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) { textBlock,
page.getRotation())) {
textBlock.setClassification(PageBlockType.FOOTER); textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock, } else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks() document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -18,8 +18,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationFooter; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationFooter;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationHeader; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationHeader;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree; import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
@ -31,6 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Pa
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Paragraph; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Paragraph;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock; import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.IdBuilder; import com.knecon.fforesight.service.layoutparser.processor.utils.IdBuilder;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations; import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;

View File

@ -8,10 +8,10 @@ import java.util.List;
import java.util.Locale; import java.util.Locale;
import java.util.Objects; import java.util.Objects;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
import com.knecon.fforesight.service.layoutparser.processor.model.text.RedTextPosition; import com.knecon.fforesight.service.layoutparser.processor.model.text.RedTextPosition;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
import lombok.experimental.UtilityClass; import lombok.experimental.UtilityClass;
@ -110,6 +110,7 @@ public class SearchTextWithTextPositionFactory {
return context.stringIdx - context.lastHyphenIdx < MAX_HYPHEN_LINEBREAK_DISTANCE; return context.stringIdx - context.lastHyphenIdx < MAX_HYPHEN_LINEBREAK_DISTANCE;
} }
private static List<Boundary> mergeToBoundaries(List<Integer> integers) { private static List<Boundary> mergeToBoundaries(List<Integer> integers) {
if (integers.isEmpty()) { if (integers.isEmpty()) {
@ -125,8 +126,9 @@ public class SearchTextWithTextPositionFactory {
} }
end = current + 1; end = current + 1;
} }
if (boundaries.isEmpty()) if (boundaries.isEmpty()) {
boundaries.add(new Boundary(start, end)); boundaries.add(new Boundary(start, end));
}
return boundaries; return boundaries;
} }
@ -138,6 +140,7 @@ public class SearchTextWithTextPositionFactory {
} }
} }
private boolean isLineBreak(RedTextPosition currentTextPosition, RedTextPosition previousTextPosition) { private boolean isLineBreak(RedTextPosition currentTextPosition, RedTextPosition previousTextPosition) {
return Objects.equals(currentTextPosition.getUnicode(), "\n") || isDeltaYLargerThanTextHeight(currentTextPosition, previousTextPosition); return Objects.equals(currentTextPosition.getUnicode(), "\n") || isDeltaYLargerThanTextHeight(currentTextPosition, previousTextPosition);

View File

@ -11,12 +11,12 @@ import java.util.Map;
import java.util.Set; import java.util.Set;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.TableMergingUtility; import com.knecon.fforesight.service.layoutparser.processor.utils.TableMergingUtility;
import lombok.experimental.UtilityClass; import lombok.experimental.UtilityClass;

View File

@ -8,15 +8,15 @@ import java.util.Set;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.TextBlock; import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.TextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations; import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
import lombok.experimental.UtilityClass; import lombok.experimental.UtilityClass;

View File

@ -2,10 +2,10 @@ package com.knecon.fforesight.service.layoutparser.processor.services.factory;
import java.util.List; import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock; import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.AccessLevel; import lombok.AccessLevel;
import lombok.experimental.FieldDefaults; import lombok.experimental.FieldDefaults;

View File

@ -7,11 +7,11 @@ import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.NoSuchElementException; import java.util.NoSuchElementException;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData; import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPage; import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPage;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree; import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;

View File

@ -1,7 +1,6 @@
package com.knecon.fforesight.service.layoutparser.processor.services.mapper; package com.knecon.fforesight.service.layoutparser.processor.services.mapper;
import java.awt.geom.Rectangle2D; import java.awt.geom.Rectangle2D;
import java.util.Collections;
import java.util.HashMap; import java.util.HashMap;
import java.util.Locale; import java.util.Locale;
import java.util.Map; import java.util.Map;
@ -9,7 +8,6 @@ import java.util.Map;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure; import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Image; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Image;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;

View File

@ -329,6 +329,7 @@ public class PDFLinesTextStripper extends PDFTextStripper {
.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj()) < maximumGapSize; .getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj()) < maximumGapSize;
} }
@Override @Override
public String getText(PDDocument doc) throws IOException { public String getText(PDDocument doc) throws IOException {

View File

@ -25,10 +25,23 @@ import java.io.StringWriter;
import java.io.Writer; import java.io.Writer;
import java.text.Bidi; import java.text.Bidi;
import java.text.Normalizer; import java.text.Normalizer;
import java.util.*; import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Deque;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.SortedSet;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.regex.Pattern; import java.util.regex.Pattern;
import lombok.Getter;
import org.apache.commons.logging.Log; import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory; import org.apache.commons.logging.LogFactory;
import org.apache.pdfbox.cos.COSDictionary; import org.apache.pdfbox.cos.COSDictionary;
@ -46,6 +59,8 @@ import org.apache.pdfbox.text.TextPositionComparator;
import com.knecon.fforesight.service.layoutparser.processor.utils.QuickSort; import com.knecon.fforesight.service.layoutparser.processor.utils.QuickSort;
import lombok.Getter;
/** /**
* This is just a copy except i only adjusted lines 594-607 cause this is a bug in Pdfbox. * This is just a copy except i only adjusted lines 594-607 cause this is a bug in Pdfbox.
* see S416.pdf * see S416.pdf
@ -194,40 +209,33 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
} }
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
{
PDMarkedContent markedContent = PDMarkedContent.create(tag, properties); PDMarkedContent markedContent = PDMarkedContent.create(tag, properties);
if (this.currentMarkedContents.isEmpty()) if (this.currentMarkedContents.isEmpty()) {
{
this.markedContents.add(markedContent); this.markedContents.add(markedContent);
} } else {
else PDMarkedContent currentMarkedContent = this.currentMarkedContents.peek();
{ if (currentMarkedContent != null) {
PDMarkedContent currentMarkedContent =
this.currentMarkedContents.peek();
if (currentMarkedContent != null)
{
currentMarkedContent.addMarkedContent(markedContent); currentMarkedContent.addMarkedContent(markedContent);
} }
} }
this.currentMarkedContents.push(markedContent); this.currentMarkedContents.push(markedContent);
} }
@Override @Override
public void endMarkedContentSequence() public void endMarkedContentSequence() {
{
if (!this.currentMarkedContents.isEmpty()) if (!this.currentMarkedContents.isEmpty()) {
{
this.currentMarkedContents.pop(); this.currentMarkedContents.pop();
} }
} }
public void xobject(PDXObject xobject) public void xobject(PDXObject xobject) {
{
if (!this.currentMarkedContents.isEmpty()) if (!this.currentMarkedContents.isEmpty()) {
{
this.currentMarkedContents.peek().addXObject(xobject); this.currentMarkedContents.peek().addXObject(xobject);
} }
} }
@ -635,7 +643,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
var normalized = normalize(line); var normalized = normalize(line);
// normalized.stream().filter(l -> System.out.println(l.getText().contains("Plenarprotokoll 20/24")).findFirst().isPresent() // normalized.stream().filter(l -> System.out.println(l.getText().contains("Plenarprotokoll 20/24")).findFirst().isPresent()
lastLineStartPosition = handleLineSeparation(current, lastPosition, lastLineStartPosition, maxHeightForLine); lastLineStartPosition = handleLineSeparation(current, lastPosition, lastLineStartPosition, maxHeightForLine);
writeLine(normalized, current.isParagraphStart); writeLine(normalized, current.isParagraphStart);
line.clear(); line.clear();
@ -914,8 +921,7 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
textList.add(text); textList.add(text);
} }
} }
if (!this.currentMarkedContents.isEmpty()) if (!this.currentMarkedContents.isEmpty()) {
{
this.currentMarkedContents.peek().addText(text); this.currentMarkedContents.peek().addText(text);
} }
} }
@ -2102,7 +2108,9 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
return endParagraphWritten; return endParagraphWritten;
} }
public void setEndParagraphWritten(){
public void setEndParagraphWritten() {
endParagraphWritten = true; endParagraphWritten = true;
} }
@ -2145,7 +2153,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
this.isHangingIndent = true; this.isHangingIndent = true;
} }
} }
} }

View File

@ -1,10 +1,13 @@
package com.knecon.fforesight.service.layoutparser.processor.services.visualization; package com.knecon.fforesight.service.layoutparser.processor.services.visualization;
import java.awt.Color;
import java.awt.geom.AffineTransform; import java.awt.geom.AffineTransform;
import java.awt.geom.Rectangle2D; import java.awt.geom.Rectangle2D;
import java.io.IOException; import java.io.IOException;
import java.io.OutputStream; import java.io.OutputStream;
import java.util.HashSet; import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set; import java.util.Set;
import org.apache.pdfbox.cos.COSDictionary; import org.apache.pdfbox.cos.COSDictionary;
@ -30,6 +33,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.visualization.
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.LayoutGrid; import com.knecon.fforesight.service.layoutparser.processor.model.visualization.LayoutGrid;
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.PlacedText; import com.knecon.fforesight.service.layoutparser.processor.model.visualization.PlacedText;
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.VisualizationsOnPage; import com.knecon.fforesight.service.layoutparser.processor.model.visualization.VisualizationsOnPage;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import lombok.RequiredArgsConstructor; import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows; import lombok.SneakyThrows;
@ -40,7 +45,6 @@ import lombok.extern.slf4j.Slf4j;
@RequiredArgsConstructor @RequiredArgsConstructor
public class ViewerDocumentService { public class ViewerDocumentService {
private static final String LAYER_NAME = "Layout grid"; private static final String LAYER_NAME = "Layout grid";
private static final int FONT_SIZE = 10; private static final int FONT_SIZE = 10;
public static final float LINE_WIDTH = 1f; public static final float LINE_WIDTH = 1f;
@ -49,13 +53,18 @@ public class ViewerDocumentService {
@SneakyThrows @SneakyThrows
public void createViewerDocument(PDDocument pdDocument, Document document, OutputStream outputStream, boolean layerVisibilityDefaultValue) { public void createViewerDocument(PDDocument pdDocument,
Document document,
OutputStream outputStream,
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells,
boolean layerVisibilityDefaultValue) {
LayoutGrid layoutGrid = layoutGridService.createLayoutGrid(document); LayoutGrid layoutGrid = layoutGridService.createLayoutGrid(document);
// PDDocument.save() is very slow, since it actually traverses the entire pdf and writes a new one. // PDDocument.save() is very slow, since it actually traverses the entire pdf and writes a new one.
// If we collect all COSDictionaries we changed and tell it explicitly to only add the changed ones by using saveIncremental it's very fast. // If we collect all COSDictionaries we changed and tell it explicitly to only add the changed ones by using saveIncremental it's very fast.
Set<COSDictionary> dictionariesToUpdate = new HashSet<>(); Set<COSDictionary> dictionariesToUpdate = new HashSet<>();
PDOptionalContentGroup layer = addLayerToDocument(pdDocument, dictionariesToUpdate, layerVisibilityDefaultValue); PDOptionalContentGroup layer = addLayerToDocument(pdDocument, dictionariesToUpdate, layerVisibilityDefaultValue);
PDOptionalContentGroup visualLayoutParsingLayer = addLayerToDocument(pdDocument, dictionariesToUpdate, true);
PDFont font = new PDType1Font(Standard14Fonts.FontName.HELVETICA); PDFont font = new PDType1Font(Standard14Fonts.FontName.HELVETICA);
for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); pageNumber++) { for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); pageNumber++) {
@ -114,6 +123,30 @@ public class ViewerDocumentService {
} }
contentStream.restoreGraphicsState(); contentStream.restoreGraphicsState();
contentStream.endMarkedContent(); contentStream.endMarkedContent();
contentStream.beginMarkedContent(COSName.OC, visualLayoutParsingLayer);
contentStream.saveGraphicsState();
contentStream.setLineWidth(LINE_WIDTH);
for (VisualLayoutParsingResult tableCells : extractedTableCells.get(pageNumber)) {
contentStream.setStrokingColor(new Color(0xFF0000));
contentStream.addRect((float) tableCells.getX0(), (float) tableCells.getY0(), (float) tableCells.getWidth(), (float) tableCells.getHeight());
contentStream.stroke();
contentStream.setFont(font, FONT_SIZE);
contentStream.beginText();
Matrix textMatrix = new Matrix((float) textDeRotationMatrix.getScaleX(),
(float) textDeRotationMatrix.getShearX(),
(float) textDeRotationMatrix.getShearY(),
(float) textDeRotationMatrix.getScaleY(),
tableCells.getX0() ,
tableCells.getY0());
textMatrix.translate(-((font.getStringWidth(tableCells.getLabel()) / 1000) * FONT_SIZE + (2 * LINE_WIDTH) + 4), -FONT_SIZE);
contentStream.setTextMatrix(textMatrix);
contentStream.showText(tableCells.getLabel());
contentStream.endText();
}
contentStream.restoreGraphicsState();
contentStream.endMarkedContent();
} }
dictionariesToUpdate.add(pdPage.getCOSObject()); dictionariesToUpdate.add(pdPage.getCOSObject());
dictionariesToUpdate.add(pdPage.getResources().getCOSObject()); dictionariesToUpdate.add(pdPage.getResources().getCOSObject());

View File

@ -1,12 +1,5 @@
package com.knecon.fforesight.service.layoutparser.processor.utils; package com.knecon.fforesight.service.layoutparser.processor.utils;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.experimental.UtilityClass;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import org.apache.pdfbox.text.TextPosition;
import java.awt.geom.Rectangle2D; import java.awt.geom.Rectangle2D;
import java.util.Collection; import java.util.Collection;
import java.util.Collections; import java.util.Collections;
@ -14,12 +7,22 @@ import java.util.List;
import java.util.Map; import java.util.Map;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import org.apache.pdfbox.text.TextPosition;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.experimental.UtilityClass;
@UtilityClass @UtilityClass
public class MarkedContentUtils { public class MarkedContentUtils {
public static final String HEADER = "Header"; public static final String HEADER = "Header";
public static final String FOOTER = "Footer"; public static final String FOOTER = "Footer";
public List<Rectangle2D> getMarkedContentBboxPerLine(List<PDMarkedContent> markedContents, String subtype) { public List<Rectangle2D> getMarkedContentBboxPerLine(List<PDMarkedContent> markedContents, String subtype) {
if (markedContents == null) { if (markedContents == null) {
@ -31,7 +34,8 @@ public class MarkedContentUtils {
.filter(m -> m.getProperties() != null) .filter(m -> m.getProperties() != null)
.filter(m -> m.getProperties().getItem("Subtype") != null) .filter(m -> m.getProperties().getItem("Subtype") != null)
.filter(m -> ((COSName) m.getProperties().getItem("Subtype")).getName().equals(subtype)) .filter(m -> ((COSName) m.getProperties().getItem("Subtype")).getName().equals(subtype))
.map(PDMarkedContent::getContents).flatMap(Collection::stream) .map(PDMarkedContent::getContents)
.flatMap(Collection::stream)
.filter(t -> t instanceof TextPosition) .filter(t -> t instanceof TextPosition)
.map(t -> (TextPosition) t) .map(t -> (TextPosition) t)
.filter(t -> !t.getUnicode().equals(" ")) .filter(t -> !t.getUnicode().equals(" "))
@ -41,16 +45,19 @@ public class MarkedContentUtils {
return Collections.emptyList(); return Collections.emptyList();
} }
return markedContentByYPosition.values().stream() return markedContentByYPosition.values()
.map(textPositions -> new TextPositionSequence(textPositions.stream() .stream()
.toList(), 0, true) .map(textPositions -> new TextPositionSequence(textPositions.stream().toList(), 0, true).getRectangle())
.getRectangle()) .map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight())))
.map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight()))).collect(Collectors.toList()); .collect(Collectors.toList());
} }
public boolean intersects(TextPageBlock textBlock, Map<String, List<Rectangle2D>> markedContentBboxPerType, String type) { public boolean intersects(TextPageBlock textBlock, Map<String, List<Rectangle2D>> markedContentBboxPerType, String type) {
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type).stream().anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type)
.stream()
.anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
} }
} }

View File

@ -19,10 +19,9 @@ public final class PositionUtils {
double threshold = textBlock.getMostPopularWordHeight() * 3; double threshold = textBlock.getMostPopularWordHeight() * 3;
if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX() if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX() && textBlock.getPdfMaxX() - threshold < btf.getTopLeft()
&& textBlock.getPdfMaxX() - threshold < btf.getTopLeft().getX() + btf.getWidth() .getX() + btf.getWidth() && textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY() && textBlock.getPdfMaxY() - threshold < btf.getTopLeft()
&& textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY() .getY() + btf.getHeight()) {
&& textBlock.getPdfMaxY() - threshold < btf.getTopLeft().getY() + btf.getHeight()) {
return true; return true;
} else { } else {
return false; return false;

View File

@ -41,11 +41,14 @@ public class RectangleTransformations {
return atomicTextBlocks.stream().flatMap(atomicTextBlock -> atomicTextBlock.getPositions().stream()).collect(new Rectangle2DBBoxCollector()); return atomicTextBlocks.stream().flatMap(atomicTextBlock -> atomicTextBlock.getPositions().stream()).collect(new Rectangle2DBBoxCollector());
} }
public static Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> collectBBox() { public static Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> collectBBox() {
return new Rectangle2DBBoxCollector(); return new Rectangle2DBBoxCollector();
} }
public static PDRectangle toPDRectangleBBox(List<Rectangle> rectangles) { public static PDRectangle toPDRectangleBBox(List<Rectangle> rectangles) {
Rectangle2D rectangle2D = RectangleTransformations.rectangleBBox(rectangles); Rectangle2D rectangle2D = RectangleTransformations.rectangleBBox(rectangles);
@ -70,6 +73,7 @@ public class RectangleTransformations {
return format("%f,%f,%f,%f", rectangle2D.getX(), rectangle2D.getY(), rectangle2D.getWidth(), rectangle2D.getHeight()); return format("%f,%f,%f,%f", rectangle2D.getX(), rectangle2D.getY(), rectangle2D.getWidth(), rectangle2D.getHeight());
} }
public static Rectangle2D rectangleBBox(List<Rectangle> rectangles) { public static Rectangle2D rectangleBBox(List<Rectangle> rectangles) {
return rectangles.stream().map(RectangleTransformations::toRectangle2D).collect(new Rectangle2DBBoxCollector()); return rectangles.stream().map(RectangleTransformations::toRectangle2D).collect(new Rectangle2DBBoxCollector());
@ -84,6 +88,7 @@ public class RectangleTransformations {
-redactionLogRectangle.getHeight()); -redactionLogRectangle.getHeight());
} }
public static Rectangle2D toRectangle2D(PDRectangle rectangle) { public static Rectangle2D toRectangle2D(PDRectangle rectangle) {
return new Rectangle2D.Double(rectangle.getLowerLeftX(), rectangle.getLowerLeftY(), rectangle.getWidth(), rectangle.getHeight()); return new Rectangle2D.Double(rectangle.getLowerLeftX(), rectangle.getLowerLeftY(), rectangle.getWidth(), rectangle.getHeight());

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.utils;
import java.util.List; import java.util.List;
import java.util.stream.Collectors; import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;

View File

@ -28,15 +28,13 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPosit
* *
* @author Ben Litchfield * @author Ben Litchfield
*/ */
public class TextPositionSequenceComparator implements Comparator<TextPositionSequence> public class TextPositionSequenceComparator implements Comparator<TextPositionSequence> {
{
@Override @Override
public int compare(TextPositionSequence pos1, TextPositionSequence pos2) public int compare(TextPositionSequence pos1, TextPositionSequence pos2) {
{
// only compare text that is in the same direction // only compare text that is in the same direction
int cmp1 = Float.compare(pos1.getDir().getDegrees(), pos2.getDir().getDegrees()); int cmp1 = Float.compare(pos1.getDir().getDegrees(), pos2.getDir().getDegrees());
if (cmp1 != 0) if (cmp1 != 0) {
{
return cmp1; return cmp1;
} }
@ -54,19 +52,13 @@ public class TextPositionSequenceComparator implements Comparator<TextPositionSe
float yDifference = Math.abs(pos1YBottom - pos2YBottom); float yDifference = Math.abs(pos1YBottom - pos2YBottom);
// we will do a simple tolerance comparison // we will do a simple tolerance comparison
if (yDifference < .1 || if (yDifference < .1 || pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) {
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
return Float.compare(x1, x2); return Float.compare(x1, x2);
} } else if (pos1YBottom < pos2YBottom) {
else if (pos1YBottom < pos2YBottom)
{
return -1; return -1;
} } else {
else
{
return 1; return 1;
} }
} }
} }

View File

@ -14,7 +14,7 @@ import com.knecon.fforesight.service.layoutparser.server.queue.MessagingConfigur
import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration; import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration;
@ImportAutoConfiguration({MultiTenancyAutoConfiguration.class}) @ImportAutoConfiguration({MultiTenancyAutoConfiguration.class})
@Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class}) @Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class})
@SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class}) @SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class})
public class Application { public class Application {

View File

@ -17,6 +17,7 @@ import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest
import lombok.SneakyThrows; import lombok.SneakyThrows;
public class DocumentDataTests extends BuildDocumentTest { public class DocumentDataTests extends BuildDocumentTest {
@Test @Test
@SneakyThrows @SneakyThrows
public void createDocumentDataForAllFiles() { public void createDocumentDataForAllFiles() {
@ -36,11 +37,12 @@ public class DocumentDataTests extends BuildDocumentTest {
for (String pdfFileName : pdfFileNames) { for (String pdfFileName : pdfFileNames) {
System.out.println(pdfFileName); System.out.println(pdfFileName);
DocumentData documentData = DocumentDataMapper.toDocumentData(buildGraph(resource.getFile().toPath().getParent().relativize(Path.of(pdfFileName)).toString())); DocumentData documentData = DocumentDataMapper.toDocumentData(buildGraph(resource.getFile().toPath().getParent().relativize(Path.of(pdfFileName)).toString()));
File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile(); File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile();
outputFile.toPath().getParent().toFile().mkdirs(); outputFile.toPath().getParent().toFile().mkdirs();
try (var out = new FileOutputStream(outputFile.toString().replace(".pdf", ".json"))) { try (var out = new FileOutputStream(outputFile.toString().replace(".pdf", ".json"))) {
ObjectMapperFactory.create().writeValue(out, documentData); ObjectMapperFactory.create().writeValue(out, documentData);
} }
} }
} }
} }

View File

@ -7,7 +7,6 @@ import java.util.HashMap;
import java.util.List; import java.util.List;
import java.util.Map; import java.util.Map;
import lombok.SneakyThrows;
import org.apache.pdfbox.Loader; import org.apache.pdfbox.Loader;
import org.apache.pdfbox.cos.COSArray; import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase; import org.apache.pdfbox.cos.COSBase;
@ -25,18 +24,24 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import org.springframework.core.io.ClassPathResource; import org.springframework.core.io.ClassPathResource;
import lombok.SneakyThrows;
/** /**
* @author mkl * @author mkl
*/ */
@Disabled @Disabled
public class ExtractMarkedContentTest { public class ExtractMarkedContentTest {
final static File RESULT_FOLDER = new File("target/test-outputs", "extract"); final static File RESULT_FOLDER = new File("target/test-outputs", "extract");
@BeforeEach @BeforeEach
public void setUpBeforeClass() throws Exception { public void setUpBeforeClass() throws Exception {
RESULT_FOLDER.mkdirs(); RESULT_FOLDER.mkdirs();
} }
/** /**
* <a href="https://stackoverflow.com/questions/54956720/how-to-replace-a-space-with-a-word-while-extract-the-data-from-pdf-using-pdfbox"> * <a href="https://stackoverflow.com/questions/54956720/how-to-replace-a-space-with-a-word-while-extract-the-data-from-pdf-using-pdfbox">
* How to replace a space with a word while extract the data from PDF using PDFBox * How to replace a space with a word while extract the data from PDF using PDFBox
@ -52,6 +57,7 @@ public class ExtractMarkedContentTest {
@Test @Test
@SneakyThrows @SneakyThrows
public void testExtractTestWPhromma() throws IOException { public void testExtractTestWPhromma() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "testWPhromma.pdf"); System.out.printf("\n\n===\n%s\n===\n", "testWPhromma.pdf");
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) { try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
@ -74,6 +80,7 @@ public class ExtractMarkedContentTest {
} }
} }
/** /**
* <a href="https://stackoverflow.com/questions/59192443/get-tags-related-bboxs-even-though-there-is-no-attributes-a-in-document-cata"> * <a href="https://stackoverflow.com/questions/59192443/get-tags-related-bboxs-even-though-there-is-no-attributes-a-in-document-cata">
* Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox? * Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox?
@ -88,9 +95,10 @@ public class ExtractMarkedContentTest {
*/ */
@Test @Test
public void testExtractResMultipage() throws IOException { public void testExtractResMultipage() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "res_multipage.pdf"); System.out.printf("\n\n===\n%s\n===\n", "res_multipage.pdf");
try(PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) { try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>(); Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();
@ -111,6 +119,7 @@ public class ExtractMarkedContentTest {
} }
} }
/** /**
* <a href="https://issues.apache.org/jira/browse/PDFBOX-5613"> * <a href="https://issues.apache.org/jira/browse/PDFBOX-5613">
* PDFBOX-5613 - uncorrent paragraph split * PDFBOX-5613 - uncorrent paragraph split
@ -125,6 +134,7 @@ public class ExtractMarkedContentTest {
*/ */
@Test @Test
public void testExtractDailyReport() throws IOException { public void testExtractDailyReport() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "Daily Report.pdf"); System.out.printf("\n\n===\n%s\n===\n", "Daily Report.pdf");
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) { try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
@ -147,10 +157,12 @@ public class ExtractMarkedContentTest {
} }
} }
/** /**
* @see #testExtractTestWPhromma() * @see #testExtractTestWPhromma()
*/ */
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) { void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) {
String structType = null; String structType = null;
PDPage page = null; PDPage page = null;
if (node instanceof PDStructureElement) { if (node instanceof PDStructureElement) {
@ -166,7 +178,7 @@ public class ExtractMarkedContentTest {
if (base instanceof COSDictionary) { if (base instanceof COSDictionary) {
showStructure(PDStructureNode.create((COSDictionary) base), markedContents); showStructure(PDStructureNode.create((COSDictionary) base), markedContents);
} else if (base instanceof COSNumber) { } else if (base instanceof COSNumber) {
showContent(((COSNumber)base).intValue(), theseMarkedContents); showContent(((COSNumber) base).intValue(), theseMarkedContents);
} else { } else {
System.out.printf("?%s\n", base); System.out.printf("?%s\n", base);
} }
@ -174,7 +186,7 @@ public class ExtractMarkedContentTest {
} else if (object instanceof PDStructureNode) { } else if (object instanceof PDStructureNode) {
showStructure((PDStructureNode) object, markedContents); showStructure((PDStructureNode) object, markedContents);
} else if (object instanceof Integer) { } else if (object instanceof Integer) {
showContent((Integer)object, theseMarkedContents); showContent((Integer) object, theseMarkedContents);
} else { } else {
System.out.printf("?%s\n", object); System.out.printf("?%s\n", object);
} }
@ -183,21 +195,24 @@ public class ExtractMarkedContentTest {
System.out.printf("</%s>\n", structType); System.out.printf("</%s>\n", structType);
} }
/** /**
* @see #showStructure(PDStructureNode, Map) * @see #showStructure(PDStructureNode, Map)
* @see #testExtractTestWPhromma() * @see #testExtractTestWPhromma()
*/ */
void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) { void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) {
PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null; PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList(); List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
StringBuilder textContent = new StringBuilder(); StringBuilder textContent = new StringBuilder();
for (Object object : contents) { for (Object object : contents) {
if (object instanceof TextPosition) { if (object instanceof TextPosition) {
textContent.append(((TextPosition)object).getUnicode()); textContent.append(((TextPosition) object).getUnicode());
} else { } else {
textContent.append("?" + object); textContent.append("?" + object);
} }
} }
System.out.printf("%s\n", textContent); System.out.printf("%s\n", textContent);
} }
} }

View File

@ -2,31 +2,20 @@ package com.knecon.fforesight.service.layoutparser.server.graph;
import java.io.FileOutputStream; import java.io.FileOutputStream;
import java.nio.file.Path; import java.nio.file.Path;
import java.util.List;
import org.apache.pdfbox.Loader; import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocument;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.core.io.ClassPathResource; import org.springframework.core.io.ClassPathResource;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.NodeType;
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingType; import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingType;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationDocument; import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationDocument;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document; import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse; import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService; import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
import com.knecon.fforesight.service.layoutparser.processor.services.classification.RedactManagerClassificationService; import com.knecon.fforesight.service.layoutparser.processor.services.classification.RedactManagerClassificationService;
import com.knecon.fforesight.service.layoutparser.processor.services.factory.DocumentGraphFactory;
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.DocumentDataMapper;
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.PropertiesMapper;
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.LayoutGridService; import com.knecon.fforesight.service.layoutparser.processor.services.visualization.LayoutGridService;
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.ViewerDocumentService; import com.knecon.fforesight.service.layoutparser.processor.services.visualization.ViewerDocumentService;
import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest; import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest;
@ -41,6 +30,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
@Autowired @Autowired
private RedactManagerClassificationService redactManagerClassificationService; private RedactManagerClassificationService redactManagerClassificationService;
@Test @Test
@SneakyThrows @SneakyThrows
public void testViewerDocument() { public void testViewerDocument() {
@ -55,6 +45,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
} }
} }
public ClassificationDocument buildClassificationDocument(PDDocument originDocument) { public ClassificationDocument buildClassificationDocument(PDDocument originDocument) {
ClassificationDocument classificationDocument = layoutParsingPipeline.parseLayout(LayoutParsingType.REDACT_MANAGER, ClassificationDocument classificationDocument = layoutParsingPipeline.parseLayout(LayoutParsingType.REDACT_MANAGER,

View File

@ -9,7 +9,6 @@ import org.apache.pdfbox.util.Matrix;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectMapper;
import com.iqser.red.storage.commons.properties.StorageProperties;
import com.iqser.red.storage.commons.service.ObjectSerializer; import com.iqser.red.storage.commons.service.ObjectSerializer;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;

View File

@ -13,8 +13,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.PageInformatio
import com.knecon.fforesight.service.layoutparser.processor.services.DividingColumnDetectionService; import com.knecon.fforesight.service.layoutparser.processor.services.DividingColumnDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.GapDetectionService; import com.knecon.fforesight.service.layoutparser.processor.services.GapDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.GapsAcrossLinesService; import com.knecon.fforesight.service.layoutparser.processor.services.GapsAcrossLinesService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor; import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw; import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows; import lombok.SneakyThrows;
@ -36,7 +36,8 @@ class GapAcrossLinesDetectionServiceTest {
System.out.println("start column detection"); System.out.println("start column detection");
start = System.currentTimeMillis(); start = System.currentTimeMillis();
for (PageInformation pageInformation : pageInformations) { for (PageInformation pageInformation : pageInformations) {
GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(), pageInformation.getMainBodyTextFrame()); GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(),
pageInformation.getMainBodyTextFrame());
columnsPerPage.add(GapsAcrossLinesService.detectXGapsAcrossLines(gapInformation, pageInformation.getMainBodyTextFrame())); columnsPerPage.add(GapsAcrossLinesService.detectXGapsAcrossLines(gapInformation, pageInformation.getMainBodyTextFrame()));
} }
System.out.printf("Finished column detection in %d ms%n", System.currentTimeMillis() - start); System.out.printf("Finished column detection in %d ms%n", System.currentTimeMillis() - start);

View File

@ -12,8 +12,8 @@ import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation; import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.services.InvisibleTableDetectionService; import com.knecon.fforesight.service.layoutparser.processor.services.InvisibleTableDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor; import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations; import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw; import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;

View File

@ -22,7 +22,6 @@ class MainBodyTextFrameExtractionServiceTest {
String tmpFileName = Path.of("/tmp/").resolve(Path.of(fileName).getFileName() + "_MAIN_BODY.pdf").toString(); String tmpFileName = Path.of("/tmp/").resolve(Path.of(fileName).getFileName() + "_MAIN_BODY.pdf").toString();
List<PageContents> sortedTextPositionSequence = PageContentExtractor.getSortedPageContents(fileName); List<PageContents> sortedTextPositionSequence = PageContentExtractor.getSortedPageContents(fileName);
} }
} }

View File

@ -3,13 +3,12 @@ package com.knecon.fforesight.service.layoutparser.server.services;
import java.nio.file.Path; import java.nio.file.Path;
import java.util.List; import java.util.List;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageContents; import com.knecon.fforesight.service.layoutparser.processor.model.PageContents;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence; import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor; import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw; import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows; import lombok.SneakyThrows;
@ -27,14 +26,11 @@ class PageContentExtractorTest {
PdfDraw.drawRectanglesPerPageNumberedByLine(fileName, PdfDraw.drawRectanglesPerPageNumberedByLine(fileName,
textPositionPerPage.stream() textPositionPerPage.stream()
.map(t -> t.getSortedTextPositionSequences() .map(t -> t.getSortedTextPositionSequences().stream().map(TextPositionSequence::getRectangle).map(RectangleTransformations::toRectangle2D)
.stream()
.map(TextPositionSequence::getRectangle)
.map(RectangleTransformations::toRectangle2D)
//.map(textPositionSequence -> (Rectangle2D) new Rectangle2D.Double(textPositionSequence.getMaxXDirAdj(), textPositionSequence.getMaxYDirAdj(), textPositionSequence.getWidth(), textPositionSequence.getHeight())) //.map(textPositionSequence -> (Rectangle2D) new Rectangle2D.Double(textPositionSequence.getMaxXDirAdj(), textPositionSequence.getMaxYDirAdj(), textPositionSequence.getWidth(), textPositionSequence.getHeight()))
.map(List::of) .map(List::of).toList())
.toList()) .toList(),
.toList(), tmpFileName); tmpFileName);
} }
} }

View File

@ -7,8 +7,8 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test; import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation; import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor; import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw; import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows; import lombok.SneakyThrows;
@ -38,6 +38,7 @@ class PageInformationServiceTest {
System.out.printf("Finished drawing rectangles in %d ms%n", System.currentTimeMillis() - start); System.out.printf("Finished drawing rectangles in %d ms%n", System.currentTimeMillis() - start);
} }
@Test @Test
@Disabled @Disabled
@SneakyThrows @SneakyThrows

View File

@ -1,5 +1,25 @@
package com.knecon.fforesight.service.layoutparser.server.utils; package com.knecon.fforesight.service.layoutparser.server.utils;
import java.io.InputStream;
import java.util.Optional;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.context.annotation.Primary;
import org.springframework.core.io.ClassPathResource;
import org.springframework.test.context.junit.jupiter.SpringExtension;
import com.iqser.red.commons.jackson.ObjectMapperFactory; import com.iqser.red.commons.jackson.ObjectMapperFactory;
import com.iqser.red.storage.commons.service.StorageService; import com.iqser.red.storage.commons.service.StorageService;
import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService; import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService;
@ -9,22 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.LayoutParsingStorage
import com.knecon.fforesight.service.layoutparser.server.Application; import com.knecon.fforesight.service.layoutparser.server.Application;
import com.knecon.fforesight.tenantcommons.TenantContext; import com.knecon.fforesight.tenantcommons.TenantContext;
import com.knecon.fforesight.tenantcommons.TenantsClient; import com.knecon.fforesight.tenantcommons.TenantsClient;
import lombok.SneakyThrows;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.context.annotation.*;
import org.springframework.core.io.ClassPathResource;
import org.springframework.test.context.junit.jupiter.SpringExtension;
import java.io.InputStream; import lombok.SneakyThrows;
import java.util.Optional;
@ExtendWith(SpringExtension.class) @ExtendWith(SpringExtension.class)
@SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT) @SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ -100,6 +106,7 @@ public abstract class AbstractTest {
return buildDefaultLayoutParsingRequest(LayoutParsingType.REDACT_MANAGER); return buildDefaultLayoutParsingRequest(LayoutParsingType.REDACT_MANAGER);
} }
protected LayoutParsingRequest buildDefaultLayoutParsingRequest(LayoutParsingType layoutParsingType) { protected LayoutParsingRequest buildDefaultLayoutParsingRequest(LayoutParsingType layoutParsingType) {
return LayoutParsingRequest.builder() return LayoutParsingRequest.builder()
@ -116,6 +123,7 @@ public abstract class AbstractTest {
.build(); .build();
} }
@SneakyThrows @SneakyThrows
protected LayoutParsingRequest prepareStorage(String file, String cvServiceResponseFile, String imageInfoFile) { protected LayoutParsingRequest prepareStorage(String file, String cvServiceResponseFile, String imageInfoFile) {
@ -152,7 +160,6 @@ public abstract class AbstractTest {
@ComponentScan("com.knecon.fforesight.service.layoutparser") @ComponentScan("com.knecon.fforesight.service.layoutparser")
public static class TestConfiguration { public static class TestConfiguration {
@Bean @Bean
@Primary @Primary
public StorageService inmemoryStorage() { public StorageService inmemoryStorage() {