Compare commits

...

2 Commits

Author SHA1 Message Date
yhampe
0f11d901f6 RED-7964: Prototype Visual Layout Parsing:
added layer for visual layout parsing results (true by default)
added label to drawn boxes
2023-12-21 08:54:03 +01:00
yhampe
be2350b289 RED-7964: Prototype Visual Layout Parsing:
drawing result from visual layout parsing into viewer document
2023-12-12 15:28:17 +01:00
67 changed files with 2526 additions and 238 deletions

View File

@ -16,6 +16,6 @@ deploy:
reports:
dotenv: version.env
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_COMMIT_BRANCH =~ /^release/
- if: $CI_COMMIT_TAG

View File

@ -1,39 +1,44 @@
# PDF Layout Parser Micro-Service: layout-parser
# PDF Layout Parser Micro-Service: layout-parser
## Introduction
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache
PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser
micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
### Key Steps in the PDF Layout Parsing Process:
* **Text Position Extraction:**
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
* **Word Segmentation and Text Block Formation:**
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
* **Text Block Classification:**
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties, distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties,
distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
* **Layout Coherence Establishment:**
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other elements in a logical and structured manner.
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other
elements in a logical and structured manner.
* **Output Generation in Various Formats:**
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream
micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
### Optional Enhancements:
* **ML-Based Table Extraction:**
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON file, which are then integrated seamlessly into the layout structure.
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON
file, which are then integrated seamlessly into the layout structure.
* **Image Classification using ML:**
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction
option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It
proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream
micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
## Installation
@ -49,41 +54,55 @@ To build and test the micro-service, follow these steps:
### Clone the Repository:
bash
```
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
cd layout-parser
```
### Build the Project:
Use the following command to build the project using Gradle:
```
gradle clean build
```
### Run Tests:
Run the test suite using the following command:
```
gradle test
```
## Building a Custom Docker Image
To create a custom Docker image for the layout-parser micro-service, execute the provided script:
### Ensure Docker is Installed:
Ensure that Docker is installed and running on your system.
### Run the Image Building Script:
Execute the publish-custom-image script in the project directory:
```
./publish-custom-image
```
## Publishing to Internal Maven Repository
To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
```
gradle -Pversion=buildVersion publish
```
Replace buildVersion with the desired version number.
## Additional Notes
Make sure to configure any necessary application properties before deploying the micro-service.
For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.

View File

@ -1 +1 @@
version = 0.1-SNAPSHOT
version = 0.1 - SNAPSHOT

View File

@ -25,7 +25,6 @@ public class DocumentStructure {
@Schema(description = "The root EntryData represents the Document.")
EntryData root;
@Schema(description = "Object containing the extra field names, a table has in its properties field.")
public static class TableProperties {
@ -56,6 +55,7 @@ public class DocumentStructure {
public static final String RECTANGLE_DELIMITER = ";";
public static Rectangle2D parseRectangle2D(String bBox) {
List<Float> floats = Arrays.stream(bBox.split(RECTANGLE_DELIMITER)).map(Float::parseFloat).toList();

View File

@ -17,4 +17,5 @@ public class RowData {
List<ParagraphData> cellText;
@Schema(description = "The bounding box of this StructureObject. Is always exactly 4 values representing x, y, w, h, where x, y specify the lower left corner.")
float[] bBox;
}

View File

@ -8,13 +8,9 @@ import lombok.Builder;
@Builder
@Schema(description = "Object containing information about the layout parsing.")
public record LayoutParsingFinishedEvent(
@Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.")
Map<String, String> identifier,//
@Schema(description = "The duration of a single layout parsing in ms.")
long duration,//
@Schema(description = "The number of pages of the parsed document.")
int numberOfPages,//
@Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.")
String message) {
@Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.") Map<String, String> identifier,//
@Schema(description = "The duration of a single layout parsing in ms.") long duration,//
@Schema(description = "The number of pages of the parsed document.") int numberOfPages,//
@Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.") String message) {
}

View File

@ -5,4 +5,5 @@ public class LayoutParsingQueueNames {
public static final String LAYOUT_PARSING_REQUEST_QUEUE = "layout_parsing_request_queue";
public static final String LAYOUT_PARSING_DLQ = "layout_parsing_dead_letter_queue";
public static final String LAYOUT_PARSING_FINISHED_EVENT_QUEUE = "layout_parsing_response_queue";
}

View File

@ -17,24 +17,35 @@ public record LayoutParsingRequest(
Map<String, String> identifier,
@Schema(description = "Path to the original PDF file.")//
@NonNull String originFileStorageId,//
@NonNull String originFileStorageId,
//
@Schema(description = "Optional Path to the the visual layout parsing service file") Optional<String> visualLayoutParsingFileId,
@Schema(description = "Optional Path to the table extraction file.")//
Optional<String> tablesFileStorageId,//
Optional<String> tablesFileStorageId,
//
@Schema(description = "Optional Path to the image classification file.")//
Optional<String> imagesFileStorageId,//
Optional<String> imagesFileStorageId,
//
@Schema(description = "Path where the Document Structure File will be stored.")//
@NonNull String structureFileStorageId,//
@NonNull String structureFileStorageId,
//
@Schema(description = "Path where the Research Data File will be stored.")//
String researchDocumentStorageId,//
String researchDocumentStorageId,
//
@Schema(description = "Path where the Document Text File will be stored.")//
@NonNull String textBlockFileStorageId,//
@NonNull String textBlockFileStorageId,
//
@Schema(description = "Path where the Document Positions File will be stored.")//
@NonNull String positionBlockFileStorageId,//
@NonNull String positionBlockFileStorageId,
//
@Schema(description = "Path where the Document Pages File will be stored.")//
@NonNull String pageFileStorageId,//
@NonNull String pageFileStorageId,
//
@Schema(description = "Path where the Simplified Text File will be stored.")//
@NonNull String simplifiedTextStorageId,//
@NonNull String simplifiedTextStorageId,
//
@Schema(description = "Path where the Viewer Document PDF will be stored.")//
@NonNull String viewerDocumentStorageId) {

View File

@ -30,9 +30,12 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageB
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.CvTableParsingAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.ImageServiceResponseAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.VisualLayoutParsingAdapter;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
import com.knecon.fforesight.service.layoutparser.processor.services.RulingCleaningService;
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
@ -71,6 +74,7 @@ public class LayoutParsingPipeline {
private final BodyTextFrameService bodyTextFrameService;
private final RulingCleaningService rulingCleaningService;
private final TableExtractionService tableExtractionService;
private final VisualLayoutParsingAdapter visualLayoutParsingAdapter;
private final TaasBlockificationService taasBlockificationService;
private final DocuMineBlockificationService docuMineBlockificationService;
private final RedactManagerBlockificationService redactManagerBlockificationService;
@ -92,6 +96,11 @@ public class LayoutParsingPipeline {
tableServiceResponse = layoutParsingStorageService.getTablesFile(layoutParsingRequest.tablesFileStorageId().get());
}
VisualLayoutParsingResponse visualLayoutParsingResponse = new VisualLayoutParsingResponse();
if (layoutParsingRequest.visualLayoutParsingFileId().isPresent()) {
visualLayoutParsingResponse = layoutParsingStorageService.getExtractedTablesFile(layoutParsingRequest.visualLayoutParsingFileId().get());
}
ClassificationDocument classificationDocument = parseLayout(layoutParsingRequest.layoutParsingType(), originDocument, imageServiceResponse, tableServiceResponse);
Document documentGraph = DocumentGraphFactory.buildDocumentGraph(classificationDocument);
@ -100,8 +109,9 @@ public class LayoutParsingPipeline {
layoutParsingStorageService.storeDocumentData(layoutParsingRequest, DocumentDataMapper.toDocumentData(documentGraph));
layoutParsingStorageService.storeSimplifiedText(layoutParsingRequest, simplifiedSectionTextService.toSimplifiedText(documentGraph));
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells = visualLayoutParsingAdapter.buildExtractedTablesPerPage(visualLayoutParsingResponse);
try (var out = new ByteArrayOutputStream()) {
viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, false);
viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, extractedTableCells, false);
layoutParsingStorageService.storeViewerDocument(layoutParsingRequest, out);
}
@ -244,9 +254,9 @@ public class LayoutParsingPipeline {
private void increaseDocumentStatistics(ClassificationPage classificationPage, ClassificationDocument document) {
if (!classificationPage.isLandscape()) {
document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue());
}
if (!classificationPage.isLandscape()) {
document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue());
}
document.getFontCounter().addAll(classificationPage.getFontCounter().getCountPerValue());
document.getTextHeightCounter().addAll(classificationPage.getTextHeightCounter().getCountPerValue());
document.getFontStyleCounter().addAll(classificationPage.getFontStyleCounter().getCountPerValue());

View File

@ -24,6 +24,7 @@ import com.knecon.fforesight.service.layoutparser.internal.api.data.taas.Researc
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingRequest;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.tenantcommons.TenantContext;
import lombok.RequiredArgsConstructor;
@ -74,6 +75,16 @@ public class LayoutParsingStorageService {
}
public VisualLayoutParsingResponse getExtractedTablesFile(String storageId) throws IOException {
try (InputStream inputStream = getObject(storageId)) {
VisualLayoutParsingResponse visualLayoutParsingResponse = objectMapper.readValue(inputStream, VisualLayoutParsingResponse.class);
inputStream.close();
return visualLayoutParsingResponse;
}
}
public void storeDocumentData(LayoutParsingRequest layoutParsingRequest, DocumentData documentData) {
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.structureFileStorageId(), documentData.getDocumentStructure());
@ -83,7 +94,6 @@ public class LayoutParsingStorageService {
}
public void storeResearchDocumentData(LayoutParsingRequest layoutParsingRequest, ResearchDocumentData researchDocumentData) {
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.researchDocumentStorageId(), researchDocumentData);

View File

@ -14,7 +14,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.StringFre
import lombok.Data;
import lombok.NonNull;
import lombok.RequiredArgsConstructor;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
@Data
@RequiredArgsConstructor

View File

@ -19,4 +19,5 @@ public class PageContents {
Rectangle2D cropBox;
Rectangle2D mediaBox;
List<Ruling> rulings;
}

View File

@ -108,11 +108,13 @@ public class Boundary implements Comparable<Boundary> {
return splitBoundaries;
}
public IntStream intStream() {
return IntStream.range(start, end);
}
public static Boundary merge(Collection<Boundary> boundaries) {
int minStart = boundaries.stream().mapToInt(Boundary::start).min().orElseThrow(IllegalArgumentException::new);

View File

@ -105,6 +105,7 @@ public class Document implements GenericSemanticNode {
return streamAllSubNodes().collect(Collectors.groupingBy(SemanticNode::getType, Collectors.counting()));
}
@Override
public String toString() {

View File

@ -207,6 +207,7 @@ public class Table implements SemanticNode {
return IntStream.range(0, numberOfCols).boxed().map(col -> getCell(row, col));
}
/**
* Streams all TableCells row-wise and filters them with header == true.
*

View File

@ -109,10 +109,7 @@ public class AtomicTextBlock implements TextBlock {
}
public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData,
DocumentPositionData documentPositionData,
SemanticNode parent,
Page page) {
public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData, DocumentPositionData documentPositionData, SemanticNode parent, Page page) {
return AtomicTextBlock.builder()
.id(documentTextData.getId())

View File

@ -1,14 +1,12 @@
package com.knecon.fforesight.service.layoutparser.processor.model.table;
import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
@ -50,6 +48,7 @@ public class TablePageBlock extends AbstractPageBlock {
return getColCount() == 0 || getRowCount() == 0;
}
public List<List<Cell>> getRows() {
if (rows == null) {
@ -276,21 +275,17 @@ public class TablePageBlock extends AbstractPageBlock {
}
public boolean intersects(Cell cell1, Cell cell2) {
if (cell1.getHeight() <= 0 || cell2.getHeight() <= 0) {
return false;
}
double x0 = cell1.getX() + 2;
double y0 = cell1.getY() + 2;
return (cell2.x + cell2.width > x0 &&
cell2.y + cell2.height > y0 &&
cell2.x < x0 + cell1.getWidth() -2 &&
cell2.y < y0 + cell1.getHeight() -2);
return (cell2.x + cell2.width > x0 && cell2.y + cell2.height > y0 && cell2.x < x0 + cell1.getWidth() - 2 && cell2.y < y0 + cell1.getHeight() - 2);
}
@Override
public String getText() {
@ -328,8 +323,6 @@ public class TablePageBlock extends AbstractPageBlock {
}
public String getTextAsHtml() {
StringBuilder sb = new StringBuilder();

View File

@ -2,6 +2,7 @@ package com.knecon.fforesight.service.layoutparser.processor.model.text;
import java.util.ArrayList;
import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextNormalizationUtilities;
import lombok.Getter;

View File

@ -73,7 +73,7 @@ public class TextPageBlock extends AbstractPageBlock {
return sequences.get(0).getPageWidth();
}
public static TextPageBlock merge(List<TextPageBlock> textBlocksToMerge) {
@ -82,6 +82,7 @@ public class TextPageBlock extends AbstractPageBlock {
return fromTextPositionSequences(sequences);
}
public static TextPageBlock fromTextPositionSequences(List<TextPositionSequence> wordBlockList) {
TextPageBlock textBlock = null;
@ -133,7 +134,6 @@ public class TextPageBlock extends AbstractPageBlock {
}
/**
* Returns the minX value in pdf coordinate system.
* Note: This needs to use Pdf Coordinate System where {0,0} rotated with the page rotation.

View File

@ -234,6 +234,7 @@ public class TextPositionSequence implements CharSequence {
@JsonIgnore
@JsonAttribute(ignore = true)
public String getFontStyle() {
if (textPositions.get(0).getFontName() == null) {
return "standard";
}

View File

@ -9,10 +9,10 @@ import java.util.Map;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import lombok.RequiredArgsConstructor;
@ -20,8 +20,7 @@ import lombok.RequiredArgsConstructor;
@RequiredArgsConstructor
public class ImageServiceResponseAdapter {
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse ) {
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse) {
Map<Integer, List<ClassifiedImage>> images = new HashMap<>();
imageServiceResponse.getData().forEach(imageMetadata -> {

View File

@ -0,0 +1,52 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.adapter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingBox;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import lombok.RequiredArgsConstructor;
@Service
@RequiredArgsConstructor
public class VisualLayoutParsingAdapter {
public Map<Integer, List<VisualLayoutParsingResult>> buildExtractedTablesPerPage(VisualLayoutParsingResponse visualLayoutParsingResponse) {
Map<Integer, List<VisualLayoutParsingResult>> tableCells = new HashMap<>();
visualLayoutParsingResponse.getData()
.forEach(tableData -> tableCells.computeIfAbsent(tableData.getPage_idx(), tableCell -> new ArrayList<>()).addAll(convertTableCells(tableData.getBoxes())));
return tableCells;
}
public List<VisualLayoutParsingResult> convertTableCells(List<VisualLayoutParsingBox> tableObjects) {
List<VisualLayoutParsingResult> parsedTableCells = new ArrayList<>();
tableObjects.stream().forEach(t -> {
VisualLayoutParsingResult result = new VisualLayoutParsingResult();
result.setX0(t.getBox().getX1());
result.setX1(t.getBox().getX2());
result.setY0(t.getBox().getY1());
result.setY1(t.getBox().getY2());
result.setWidth(result.getX1() - result.getX0());
result.setHeight(result.getY1() - result.getY0());
result.setLabel(t.getLabel());
parsedTableCells.add(result);
});
return parsedTableCells;
}
}

View File

@ -0,0 +1,5 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
public class ExtractedTable {
}

View File

@ -0,0 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingBox {
private VisualLayoutParsingBoxValue box;
private String label;
private float probability;
}

View File

@ -0,0 +1,19 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingBoxValue {
private float x1;
private float y1;
private float x2;
private float y2;
}

View File

@ -0,0 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingData {
private int page_idx;
private List<VisualLayoutParsingBox> boxes;
}

View File

@ -0,0 +1,23 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import java.util.List;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingResponse {
private String dossierId;
private String fileId;
private String targetFileExtension;
private String responseFileExtension;
private String X_TENANT_ID;
private List<VisualLayoutParsingData> data;
}

View File

@ -0,0 +1,22 @@
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class VisualLayoutParsingResult {
private float x0;
private float y0;
private float x1;
private float y1;
private float width;
private float height;
private String label;
}

View File

@ -25,6 +25,7 @@ public class BodyTextFrameService {
private static final float RULING_HEIGHT_THRESHOLD = 0.15f; // multiplied with page height. Header/Footer Rulings must be within that border of the page.
private static final float RULING_WIDTH_THRESHOLD = 0.75f; // multiplied with page width. Header/Footer Rulings must be at least that wide.
public void setBodyTextFrames(ClassificationDocument classificationDocument, LayoutParsingType layoutParsingType) {
Rectangle bodyTextFrame = calculateBodyTextFrame(classificationDocument.getPages(), classificationDocument.getFontSizeCounter(), false, layoutParsingType);
@ -155,8 +156,9 @@ public class BodyTextFrameService {
continue;
}
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|| MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)) {
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || MarkedContentUtils.intersects(textBlock,
page.getMarkedContentBboxPerType(),
MarkedContentUtils.FOOTER)) {
continue;
}

View File

@ -22,7 +22,6 @@ public class DividingColumnDetectionService {
public List<Rectangle2D> detectColumns(PageContents pageContents) {
if (pageContents.getSortedTextPositionSequences().size() < 2) {
return List.of(pageContents.getCropBox());
}

View File

@ -72,11 +72,13 @@ public class GapDetectionService {
return mirrorY(RectangleTransformations.toRectangle2D(textPosition.getRectangle()));
}
private static Rectangle2D mirrorY(Rectangle2D rectangle2D) {
return new Rectangle2D.Double(rectangle2D.getX(), Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY()), rectangle2D.getWidth(), Math.abs(rectangle2D.getHeight()));
}
private static void addGapToLine(Rectangle2D currentTextPosition, Rectangle2D previousTextPosition, XGapsContext context) {
context.gapsInCurrentLine.add(new Rectangle2D.Double(previousTextPosition.getMaxX(),

View File

@ -6,7 +6,6 @@ import java.util.LinkedList;
import java.util.List;
import java.util.Queue;
import java.util.stream.Stream;
import com.iqser.red.commons.jackson.ObjectMapperFactory;
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
@ -51,7 +50,9 @@ public class GapsAcrossLinesService {
}
return columnFactory.outputGaps.stream()
.filter(gapAcrossLines -> columnFactory.outputGaps.stream().filter(gapAcrossLines::intersectsX).noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount))
.filter(gapAcrossLines -> columnFactory.outputGaps.stream()
.filter(gapAcrossLines::intersectsX)
.noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount))
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMinX() - mainBodyTextFrame.getMinX()) > DISTANCE_TO_BORDER_THRESHOLD)
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMaxX() - mainBodyTextFrame.getMaxX()) > DISTANCE_TO_BORDER_THRESHOLD)
.map(GapAcrossLines::getRectangle2D)

View File

@ -6,8 +6,8 @@ import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
import com.knecon.fforesight.service.layoutparser.processor.model.LineInformation;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import lombok.AllArgsConstructor;
import lombok.Getter;

View File

@ -16,8 +16,7 @@ public class MainBodyTextFrameExtractionService {
public Rectangle2D calculateMainBodyTextFrame(LineInformation lineInformation) {
Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream()
.collect(RectangleTransformations.collectBBox());
Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream().collect(RectangleTransformations.collectBBox());
return RectangleTransformations.pad(mainBodyTextFrame, mainBodyTextFrame.getWidth() * TEXT_FRAME_PAD_WIDTH, mainBodyTextFrame.getHeight() * TEXT_FRAME_PAD_HEIGHT);
}

View File

@ -52,7 +52,7 @@ public class PageContentExtractor {
stripper.getRulings()));
}
}
return textPositionSequencesPerPage;
}

View File

@ -5,9 +5,9 @@ import java.util.List;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedSectionText;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
@Service
public class SimplifiedSectionTextService {
@ -23,4 +23,5 @@ public class SimplifiedSectionTextService {
return SimplifiedSectionText.builder().sectionNumber(section.getTreeId().get(0)).text(section.getTextBlock().getSearchText()).build();
}
}

View File

@ -1,9 +1,20 @@
package com.knecon.fforesight.service.layoutparser.processor.services.blockification;
// TODO: figure out, why this fails the build
// import static com.knecon.fforesight.service.layoutparser.processor.services.factory.SearchTextWithTextPositionFactory.HEIGHT_PADDING;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Stream;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.Orientation;
@ -11,12 +22,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.table.Ruling;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RulingTextDirAdjustUtil;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Stream;
@Service
@SuppressWarnings("all")
@ -83,13 +88,13 @@ public class TaasBlockificationService {
continue;
}
Matcher listIdentifierPattern = listIdentifier.matcher(currentTextBlock.getText());
boolean isListIdentifier = listIdentifierPattern.find();
boolean yGap = Math.abs(currentTextBlock.getPdfMaxY() - previousTextBlock.getPdfMinY()) < previousTextBlock.getMostPopularWordHeight() * Y_GAP_SPLIT_HEIGHT_MODIFIER;
boolean sameFont = previousTextBlock.getMostPopularWordFont().equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize();
boolean sameFont = previousTextBlock.getMostPopularWordFont()
.equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize();
// boolean yGap = previousTextBlock != null && currentTextBlock.getMinYDirAdj() - maxY > Math.min(word.getHeight(), prev.getHeight()) * Y_GAP_SPLIT_HEIGHT_MODIFIER;
boolean alignsXRight = Math.abs(currentTextBlock.getPdfMaxX() - previousTextBlock.getPdfMaxX()) < X_ALIGNMENT_THRESHOLD;
@ -119,8 +124,9 @@ public class TaasBlockificationService {
}
alreadyMerged.add(textPageBlock);
textBlocksToMerge.add(Stream.concat(Stream.of(textPageBlock),
textPageBlocks.stream().filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2)).peek(alreadyMerged::add))
.toList());
textPageBlocks.stream()
.filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2))
.peek(alreadyMerged::add)).toList());
}
return textBlocksToMerge.stream().map(TextPageBlock::merge).toList();
}
@ -163,8 +169,7 @@ public class TaasBlockificationService {
while (itty.hasNext()) {
TextPageBlock block = (TextPageBlock) itty.next();
if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold(
block.getMaxY(),
if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold(block.getMaxY(),
previous.getMaxY()) || previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation()
.equals(Orientation.RIGHT) && equalsWithThreshold(block.getMaxY(), previous.getMaxY())) {
previous.add(block);
@ -189,7 +194,6 @@ public class TaasBlockificationService {
TextPositionSequence prev = null;
// TODO: make static final constant
boolean wasSplitted = false;
Float splitX1 = null;
for (TextPositionSequence word : textPositions) {

View File

@ -5,7 +5,6 @@ import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -13,6 +12,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor;
@ -63,16 +63,16 @@ public class DocuMineClassificationService {
textBlock.setClassification(PageBlockType.OTHER);
return;
}
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())
) {
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())
) {
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
import java.util.List;
import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -11,6 +10,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor;
@ -21,7 +21,6 @@ import lombok.extern.slf4j.Slf4j;
@RequiredArgsConstructor
public class RedactManagerClassificationService {
public void classifyDocument(ClassificationDocument document) {
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
@ -52,14 +51,16 @@ public class RedactManagerClassificationService {
textBlock.setClassification(PageBlockType.OTHER);
return;
}
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) {
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) {
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
.getMostPopular())) {
textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
import java.util.List;
import java.util.regex.Pattern;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import org.springframework.stereotype.Service;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
@ -12,6 +11,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
import lombok.RequiredArgsConstructor;
@ -27,7 +27,6 @@ public class TaasClassificationService {
public void classifyDocument(ClassificationDocument document) {
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
log.debug("Document FontSize counters are: {}", document.getFontSizeCounter().getCountPerValue());
@ -57,11 +56,13 @@ public class TaasClassificationService {
textBlock.setClassification(PageBlockType.OTHER);
return;
}
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) {
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation())) {
textBlock.setClassification(PageBlockType.HEADER);
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) {
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
textBlock,
page.getRotation())) {
textBlock.setClassification(PageBlockType.FOOTER);
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()

View File

@ -18,8 +18,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationFooter;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationHeader;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
@ -31,6 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Pa
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Paragraph;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.IdBuilder;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;

View File

@ -8,10 +8,10 @@ import java.util.List;
import java.util.Locale;
import java.util.Objects;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
import com.knecon.fforesight.service.layoutparser.processor.model.text.RedTextPosition;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
import lombok.experimental.UtilityClass;
@ -110,6 +110,7 @@ public class SearchTextWithTextPositionFactory {
return context.stringIdx - context.lastHyphenIdx < MAX_HYPHEN_LINEBREAK_DISTANCE;
}
private static List<Boundary> mergeToBoundaries(List<Integer> integers) {
if (integers.isEmpty()) {
@ -125,8 +126,9 @@ public class SearchTextWithTextPositionFactory {
}
end = current + 1;
}
if (boundaries.isEmpty())
if (boundaries.isEmpty()) {
boundaries.add(new Boundary(start, end));
}
return boundaries;
}
@ -138,6 +140,7 @@ public class SearchTextWithTextPositionFactory {
}
}
private boolean isLineBreak(RedTextPosition currentTextPosition, RedTextPosition previousTextPosition) {
return Objects.equals(currentTextPosition.getUnicode(), "\n") || isDeltaYLargerThanTextHeight(currentTextPosition, previousTextPosition);

View File

@ -11,12 +11,12 @@ import java.util.Map;
import java.util.Set;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.utils.TableMergingUtility;
import lombok.experimental.UtilityClass;

View File

@ -8,15 +8,15 @@ import java.util.Set;
import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.TextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
import lombok.experimental.UtilityClass;

View File

@ -2,10 +2,10 @@ package com.knecon.fforesight.service.layoutparser.processor.services.factory;
import java.util.List;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.AccessLevel;
import lombok.experimental.FieldDefaults;

View File

@ -7,11 +7,11 @@ import java.util.List;
import java.util.Map;
import java.util.NoSuchElementException;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPage;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;

View File

@ -1,7 +1,6 @@
package com.knecon.fforesight.service.layoutparser.processor.services.mapper;
import java.awt.geom.Rectangle2D;
import java.util.Collections;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
@ -9,7 +8,6 @@ import java.util.Map;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Image;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;

View File

@ -329,6 +329,7 @@ public class PDFLinesTextStripper extends PDFTextStripper {
.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj()) < maximumGapSize;
}
@Override
public String getText(PDDocument doc) throws IOException {

View File

@ -25,10 +25,23 @@ import java.io.StringWriter;
import java.io.Writer;
import java.text.Bidi;
import java.text.Normalizer;
import java.util.*;
import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Deque;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.SortedSet;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.regex.Pattern;
import lombok.Getter;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.pdfbox.cos.COSDictionary;
@ -46,6 +59,8 @@ import org.apache.pdfbox.text.TextPositionComparator;
import com.knecon.fforesight.service.layoutparser.processor.utils.QuickSort;
import lombok.Getter;
/**
* This is just a copy except i only adjusted lines 594-607 cause this is a bug in Pdfbox.
* see S416.pdf
@ -194,40 +209,33 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
}
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
{
PDMarkedContent markedContent = PDMarkedContent.create(tag, properties);
if (this.currentMarkedContents.isEmpty())
{
if (this.currentMarkedContents.isEmpty()) {
this.markedContents.add(markedContent);
}
else
{
PDMarkedContent currentMarkedContent =
this.currentMarkedContents.peek();
if (currentMarkedContent != null)
{
} else {
PDMarkedContent currentMarkedContent = this.currentMarkedContents.peek();
if (currentMarkedContent != null) {
currentMarkedContent.addMarkedContent(markedContent);
}
}
this.currentMarkedContents.push(markedContent);
}
@Override
public void endMarkedContentSequence()
{
if (!this.currentMarkedContents.isEmpty())
{
public void endMarkedContentSequence() {
if (!this.currentMarkedContents.isEmpty()) {
this.currentMarkedContents.pop();
}
}
public void xobject(PDXObject xobject)
{
if (!this.currentMarkedContents.isEmpty())
{
public void xobject(PDXObject xobject) {
if (!this.currentMarkedContents.isEmpty()) {
this.currentMarkedContents.peek().addXObject(xobject);
}
}
@ -635,7 +643,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
var normalized = normalize(line);
// normalized.stream().filter(l -> System.out.println(l.getText().contains("Plenarprotokoll 20/24")).findFirst().isPresent()
lastLineStartPosition = handleLineSeparation(current, lastPosition, lastLineStartPosition, maxHeightForLine);
writeLine(normalized, current.isParagraphStart);
line.clear();
@ -914,8 +921,7 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
textList.add(text);
}
}
if (!this.currentMarkedContents.isEmpty())
{
if (!this.currentMarkedContents.isEmpty()) {
this.currentMarkedContents.peek().addText(text);
}
}
@ -2102,7 +2108,9 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
return endParagraphWritten;
}
public void setEndParagraphWritten(){
public void setEndParagraphWritten() {
endParagraphWritten = true;
}
@ -2145,7 +2153,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
this.isHangingIndent = true;
}
}
}

View File

@ -1,10 +1,13 @@
package com.knecon.fforesight.service.layoutparser.processor.services.visualization;
import java.awt.Color;
import java.awt.geom.AffineTransform;
import java.awt.geom.Rectangle2D;
import java.io.IOException;
import java.io.OutputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.pdfbox.cos.COSDictionary;
@ -30,6 +33,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.visualization.
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.LayoutGrid;
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.PlacedText;
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.VisualizationsOnPage;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
import lombok.RequiredArgsConstructor;
import lombok.SneakyThrows;
@ -40,7 +45,6 @@ import lombok.extern.slf4j.Slf4j;
@RequiredArgsConstructor
public class ViewerDocumentService {
private static final String LAYER_NAME = "Layout grid";
private static final int FONT_SIZE = 10;
public static final float LINE_WIDTH = 1f;
@ -49,13 +53,18 @@ public class ViewerDocumentService {
@SneakyThrows
public void createViewerDocument(PDDocument pdDocument, Document document, OutputStream outputStream, boolean layerVisibilityDefaultValue) {
public void createViewerDocument(PDDocument pdDocument,
Document document,
OutputStream outputStream,
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells,
boolean layerVisibilityDefaultValue) {
LayoutGrid layoutGrid = layoutGridService.createLayoutGrid(document);
// PDDocument.save() is very slow, since it actually traverses the entire pdf and writes a new one.
// If we collect all COSDictionaries we changed and tell it explicitly to only add the changed ones by using saveIncremental it's very fast.
Set<COSDictionary> dictionariesToUpdate = new HashSet<>();
PDOptionalContentGroup layer = addLayerToDocument(pdDocument, dictionariesToUpdate, layerVisibilityDefaultValue);
PDOptionalContentGroup visualLayoutParsingLayer = addLayerToDocument(pdDocument, dictionariesToUpdate, true);
PDFont font = new PDType1Font(Standard14Fonts.FontName.HELVETICA);
for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); pageNumber++) {
@ -114,6 +123,30 @@ public class ViewerDocumentService {
}
contentStream.restoreGraphicsState();
contentStream.endMarkedContent();
contentStream.beginMarkedContent(COSName.OC, visualLayoutParsingLayer);
contentStream.saveGraphicsState();
contentStream.setLineWidth(LINE_WIDTH);
for (VisualLayoutParsingResult tableCells : extractedTableCells.get(pageNumber)) {
contentStream.setStrokingColor(new Color(0xFF0000));
contentStream.addRect((float) tableCells.getX0(), (float) tableCells.getY0(), (float) tableCells.getWidth(), (float) tableCells.getHeight());
contentStream.stroke();
contentStream.setFont(font, FONT_SIZE);
contentStream.beginText();
Matrix textMatrix = new Matrix((float) textDeRotationMatrix.getScaleX(),
(float) textDeRotationMatrix.getShearX(),
(float) textDeRotationMatrix.getShearY(),
(float) textDeRotationMatrix.getScaleY(),
tableCells.getX0() ,
tableCells.getY0());
textMatrix.translate(-((font.getStringWidth(tableCells.getLabel()) / 1000) * FONT_SIZE + (2 * LINE_WIDTH) + 4), -FONT_SIZE);
contentStream.setTextMatrix(textMatrix);
contentStream.showText(tableCells.getLabel());
contentStream.endText();
}
contentStream.restoreGraphicsState();
contentStream.endMarkedContent();
}
dictionariesToUpdate.add(pdPage.getCOSObject());
dictionariesToUpdate.add(pdPage.getResources().getCOSObject());

View File

@ -1,12 +1,5 @@
package com.knecon.fforesight.service.layoutparser.processor.utils;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.experimental.UtilityClass;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import org.apache.pdfbox.text.TextPosition;
import java.awt.geom.Rectangle2D;
import java.util.Collection;
import java.util.Collections;
@ -14,12 +7,22 @@ import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import org.apache.pdfbox.text.TextPosition;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import lombok.experimental.UtilityClass;
@UtilityClass
public class MarkedContentUtils {
public static final String HEADER = "Header";
public static final String FOOTER = "Footer";
public List<Rectangle2D> getMarkedContentBboxPerLine(List<PDMarkedContent> markedContents, String subtype) {
if (markedContents == null) {
@ -31,7 +34,8 @@ public class MarkedContentUtils {
.filter(m -> m.getProperties() != null)
.filter(m -> m.getProperties().getItem("Subtype") != null)
.filter(m -> ((COSName) m.getProperties().getItem("Subtype")).getName().equals(subtype))
.map(PDMarkedContent::getContents).flatMap(Collection::stream)
.map(PDMarkedContent::getContents)
.flatMap(Collection::stream)
.filter(t -> t instanceof TextPosition)
.map(t -> (TextPosition) t)
.filter(t -> !t.getUnicode().equals(" "))
@ -41,16 +45,19 @@ public class MarkedContentUtils {
return Collections.emptyList();
}
return markedContentByYPosition.values().stream()
.map(textPositions -> new TextPositionSequence(textPositions.stream()
.toList(), 0, true)
.getRectangle())
.map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight()))).collect(Collectors.toList());
return markedContentByYPosition.values()
.stream()
.map(textPositions -> new TextPositionSequence(textPositions.stream().toList(), 0, true).getRectangle())
.map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight())))
.collect(Collectors.toList());
}
public boolean intersects(TextPageBlock textBlock, Map<String, List<Rectangle2D>> markedContentBboxPerType, String type) {
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type).stream().anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type)
.stream()
.anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
}
}

View File

@ -19,10 +19,9 @@ public final class PositionUtils {
double threshold = textBlock.getMostPopularWordHeight() * 3;
if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX()
&& textBlock.getPdfMaxX() - threshold < btf.getTopLeft().getX() + btf.getWidth()
&& textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY()
&& textBlock.getPdfMaxY() - threshold < btf.getTopLeft().getY() + btf.getHeight()) {
if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX() && textBlock.getPdfMaxX() - threshold < btf.getTopLeft()
.getX() + btf.getWidth() && textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY() && textBlock.getPdfMaxY() - threshold < btf.getTopLeft()
.getY() + btf.getHeight()) {
return true;
} else {
return false;

View File

@ -41,11 +41,14 @@ public class RectangleTransformations {
return atomicTextBlocks.stream().flatMap(atomicTextBlock -> atomicTextBlock.getPositions().stream()).collect(new Rectangle2DBBoxCollector());
}
public static Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> collectBBox() {
return new Rectangle2DBBoxCollector();
}
public static PDRectangle toPDRectangleBBox(List<Rectangle> rectangles) {
Rectangle2D rectangle2D = RectangleTransformations.rectangleBBox(rectangles);
@ -70,6 +73,7 @@ public class RectangleTransformations {
return format("%f,%f,%f,%f", rectangle2D.getX(), rectangle2D.getY(), rectangle2D.getWidth(), rectangle2D.getHeight());
}
public static Rectangle2D rectangleBBox(List<Rectangle> rectangles) {
return rectangles.stream().map(RectangleTransformations::toRectangle2D).collect(new Rectangle2DBBoxCollector());
@ -84,6 +88,7 @@ public class RectangleTransformations {
-redactionLogRectangle.getHeight());
}
public static Rectangle2D toRectangle2D(PDRectangle rectangle) {
return new Rectangle2D.Double(rectangle.getLowerLeftX(), rectangle.getLowerLeftY(), rectangle.getWidth(), rectangle.getHeight());

View File

@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.utils;
import java.util.List;
import java.util.stream.Collectors;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;

View File

@ -28,15 +28,13 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPosit
*
* @author Ben Litchfield
*/
public class TextPositionSequenceComparator implements Comparator<TextPositionSequence>
{
public class TextPositionSequenceComparator implements Comparator<TextPositionSequence> {
@Override
public int compare(TextPositionSequence pos1, TextPositionSequence pos2)
{
public int compare(TextPositionSequence pos1, TextPositionSequence pos2) {
// only compare text that is in the same direction
int cmp1 = Float.compare(pos1.getDir().getDegrees(), pos2.getDir().getDegrees());
if (cmp1 != 0)
{
if (cmp1 != 0) {
return cmp1;
}
@ -54,19 +52,13 @@ public class TextPositionSequenceComparator implements Comparator<TextPositionSe
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
// we will do a simple tolerance comparison
if (yDifference < .1 ||
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
{
if (yDifference < .1 || pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) {
return Float.compare(x1, x2);
}
else if (pos1YBottom < pos2YBottom)
{
} else if (pos1YBottom < pos2YBottom) {
return -1;
}
else
{
} else {
return 1;
}
}
}

View File

@ -14,7 +14,7 @@ import com.knecon.fforesight.service.layoutparser.server.queue.MessagingConfigur
import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration;
@ImportAutoConfiguration({MultiTenancyAutoConfiguration.class})
@Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class})
@Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class})
@SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class})
public class Application {

View File

@ -17,6 +17,7 @@ import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest
import lombok.SneakyThrows;
public class DocumentDataTests extends BuildDocumentTest {
@Test
@SneakyThrows
public void createDocumentDataForAllFiles() {
@ -36,11 +37,12 @@ public class DocumentDataTests extends BuildDocumentTest {
for (String pdfFileName : pdfFileNames) {
System.out.println(pdfFileName);
DocumentData documentData = DocumentDataMapper.toDocumentData(buildGraph(resource.getFile().toPath().getParent().relativize(Path.of(pdfFileName)).toString()));
File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile();
File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile();
outputFile.toPath().getParent().toFile().mkdirs();
try (var out = new FileOutputStream(outputFile.toString().replace(".pdf", ".json"))) {
ObjectMapperFactory.create().writeValue(out, documentData);
}
}
}
}

View File

@ -7,7 +7,6 @@ import java.util.HashMap;
import java.util.List;
import java.util.Map;
import lombok.SneakyThrows;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSBase;
@ -25,18 +24,24 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.springframework.core.io.ClassPathResource;
import lombok.SneakyThrows;
/**
* @author mkl
*/
@Disabled
public class ExtractMarkedContentTest {
final static File RESULT_FOLDER = new File("target/test-outputs", "extract");
@BeforeEach
public void setUpBeforeClass() throws Exception {
RESULT_FOLDER.mkdirs();
}
/**
* <a href="https://stackoverflow.com/questions/54956720/how-to-replace-a-space-with-a-word-while-extract-the-data-from-pdf-using-pdfbox">
* How to replace a space with a word while extract the data from PDF using PDFBox
@ -52,6 +57,7 @@ public class ExtractMarkedContentTest {
@Test
@SneakyThrows
public void testExtractTestWPhromma() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "testWPhromma.pdf");
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
@ -74,6 +80,7 @@ public class ExtractMarkedContentTest {
}
}
/**
* <a href="https://stackoverflow.com/questions/59192443/get-tags-related-bboxs-even-though-there-is-no-attributes-a-in-document-cata">
* Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox?
@ -88,9 +95,10 @@ public class ExtractMarkedContentTest {
*/
@Test
public void testExtractResMultipage() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "res_multipage.pdf");
try(PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();
@ -111,6 +119,7 @@ public class ExtractMarkedContentTest {
}
}
/**
* <a href="https://issues.apache.org/jira/browse/PDFBOX-5613">
* PDFBOX-5613 - uncorrent paragraph split
@ -125,6 +134,7 @@ public class ExtractMarkedContentTest {
*/
@Test
public void testExtractDailyReport() throws IOException {
System.out.printf("\n\n===\n%s\n===\n", "Daily Report.pdf");
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
@ -147,10 +157,12 @@ public class ExtractMarkedContentTest {
}
}
/**
* @see #testExtractTestWPhromma()
*/
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) {
String structType = null;
PDPage page = null;
if (node instanceof PDStructureElement) {
@ -166,7 +178,7 @@ public class ExtractMarkedContentTest {
if (base instanceof COSDictionary) {
showStructure(PDStructureNode.create((COSDictionary) base), markedContents);
} else if (base instanceof COSNumber) {
showContent(((COSNumber)base).intValue(), theseMarkedContents);
showContent(((COSNumber) base).intValue(), theseMarkedContents);
} else {
System.out.printf("?%s\n", base);
}
@ -174,7 +186,7 @@ public class ExtractMarkedContentTest {
} else if (object instanceof PDStructureNode) {
showStructure((PDStructureNode) object, markedContents);
} else if (object instanceof Integer) {
showContent((Integer)object, theseMarkedContents);
showContent((Integer) object, theseMarkedContents);
} else {
System.out.printf("?%s\n", object);
}
@ -183,21 +195,24 @@ public class ExtractMarkedContentTest {
System.out.printf("</%s>\n", structType);
}
/**
* @see #showStructure(PDStructureNode, Map)
* @see #testExtractTestWPhromma()
*/
void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) {
PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
StringBuilder textContent = new StringBuilder();
StringBuilder textContent = new StringBuilder();
for (Object object : contents) {
if (object instanceof TextPosition) {
textContent.append(((TextPosition)object).getUnicode());
textContent.append(((TextPosition) object).getUnicode());
} else {
textContent.append("?" + object);
}
}
System.out.printf("%s\n", textContent);
}
}

View File

@ -2,31 +2,20 @@ package com.knecon.fforesight.service.layoutparser.server.graph;
import java.io.FileOutputStream;
import java.nio.file.Path;
import java.util.List;
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.core.io.ClassPathResource;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.NodeType;
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingType;
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationDocument;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
import com.knecon.fforesight.service.layoutparser.processor.services.classification.RedactManagerClassificationService;
import com.knecon.fforesight.service.layoutparser.processor.services.factory.DocumentGraphFactory;
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.DocumentDataMapper;
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.PropertiesMapper;
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.LayoutGridService;
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.ViewerDocumentService;
import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest;
@ -41,6 +30,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
@Autowired
private RedactManagerClassificationService redactManagerClassificationService;
@Test
@SneakyThrows
public void testViewerDocument() {
@ -55,6 +45,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
}
}
public ClassificationDocument buildClassificationDocument(PDDocument originDocument) {
ClassificationDocument classificationDocument = layoutParsingPipeline.parseLayout(LayoutParsingType.REDACT_MANAGER,

View File

@ -9,7 +9,6 @@ import org.apache.pdfbox.util.Matrix;
import org.junit.jupiter.api.Test;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.iqser.red.storage.commons.properties.StorageProperties;
import com.iqser.red.storage.commons.service.ObjectSerializer;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;

View File

@ -13,8 +13,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.PageInformatio
import com.knecon.fforesight.service.layoutparser.processor.services.DividingColumnDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.GapDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.GapsAcrossLinesService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows;
@ -36,7 +36,8 @@ class GapAcrossLinesDetectionServiceTest {
System.out.println("start column detection");
start = System.currentTimeMillis();
for (PageInformation pageInformation : pageInformations) {
GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(), pageInformation.getMainBodyTextFrame());
GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(),
pageInformation.getMainBodyTextFrame());
columnsPerPage.add(GapsAcrossLinesService.detectXGapsAcrossLines(gapInformation, pageInformation.getMainBodyTextFrame()));
}
System.out.printf("Finished column detection in %d ms%n", System.currentTimeMillis() - start);

View File

@ -12,8 +12,8 @@ import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.services.InvisibleTableDetectionService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;

View File

@ -22,7 +22,6 @@ class MainBodyTextFrameExtractionServiceTest {
String tmpFileName = Path.of("/tmp/").resolve(Path.of(fileName).getFileName() + "_MAIN_BODY.pdf").toString();
List<PageContents> sortedTextPositionSequence = PageContentExtractor.getSortedPageContents(fileName);
}
}

View File

@ -3,13 +3,12 @@ package com.knecon.fforesight.service.layoutparser.server.services;
import java.nio.file.Path;
import java.util.List;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageContents;
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows;
@ -27,14 +26,11 @@ class PageContentExtractorTest {
PdfDraw.drawRectanglesPerPageNumberedByLine(fileName,
textPositionPerPage.stream()
.map(t -> t.getSortedTextPositionSequences()
.stream()
.map(TextPositionSequence::getRectangle)
.map(RectangleTransformations::toRectangle2D)
.map(t -> t.getSortedTextPositionSequences().stream().map(TextPositionSequence::getRectangle).map(RectangleTransformations::toRectangle2D)
//.map(textPositionSequence -> (Rectangle2D) new Rectangle2D.Double(textPositionSequence.getMaxXDirAdj(), textPositionSequence.getMaxYDirAdj(), textPositionSequence.getWidth(), textPositionSequence.getHeight()))
.map(List::of)
.toList())
.toList(), tmpFileName);
.map(List::of).toList())
.toList(),
tmpFileName);
}
}

View File

@ -7,8 +7,8 @@ import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
import lombok.SneakyThrows;
@ -38,6 +38,7 @@ class PageInformationServiceTest {
System.out.printf("Finished drawing rectangles in %d ms%n", System.currentTimeMillis() - start);
}
@Test
@Disabled
@SneakyThrows

View File

@ -1,5 +1,25 @@
package com.knecon.fforesight.service.layoutparser.server.utils;
import java.io.InputStream;
import java.util.Optional;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.context.annotation.Primary;
import org.springframework.core.io.ClassPathResource;
import org.springframework.test.context.junit.jupiter.SpringExtension;
import com.iqser.red.commons.jackson.ObjectMapperFactory;
import com.iqser.red.storage.commons.service.StorageService;
import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService;
@ -9,22 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.LayoutParsingStorage
import com.knecon.fforesight.service.layoutparser.server.Application;
import com.knecon.fforesight.tenantcommons.TenantContext;
import com.knecon.fforesight.tenantcommons.TenantsClient;
import lombok.SneakyThrows;
import org.junit.jupiter.api.AfterEach;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.extension.ExtendWith;
import org.springframework.amqp.rabbit.core.RabbitTemplate;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.boot.test.mock.mockito.MockBean;
import org.springframework.context.annotation.*;
import org.springframework.core.io.ClassPathResource;
import org.springframework.test.context.junit.jupiter.SpringExtension;
import java.io.InputStream;
import java.util.Optional;
import lombok.SneakyThrows;
@ExtendWith(SpringExtension.class)
@SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
@ -100,6 +106,7 @@ public abstract class AbstractTest {
return buildDefaultLayoutParsingRequest(LayoutParsingType.REDACT_MANAGER);
}
protected LayoutParsingRequest buildDefaultLayoutParsingRequest(LayoutParsingType layoutParsingType) {
return LayoutParsingRequest.builder()
@ -116,6 +123,7 @@ public abstract class AbstractTest {
.build();
}
@SneakyThrows
protected LayoutParsingRequest prepareStorage(String file, String cvServiceResponseFile, String imageInfoFile) {
@ -152,7 +160,6 @@ public abstract class AbstractTest {
@ComponentScan("com.knecon.fforesight.service.layoutparser")
public static class TestConfiguration {
@Bean
@Primary
public StorageService inmemoryStorage() {