Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
0f11d901f6 | ||
|
|
be2350b289 |
47
README.md
47
README.md
@ -1,39 +1,44 @@
|
|||||||
# PDF Layout Parser Micro-Service: layout-parser
|
# PDF Layout Parser Micro-Service: layout-parser
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
|
|
||||||
|
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache
|
||||||
|
PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser
|
||||||
|
micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
|
||||||
|
|
||||||
### Key Steps in the PDF Layout Parsing Process:
|
### Key Steps in the PDF Layout Parsing Process:
|
||||||
|
|
||||||
* **Text Position Extraction:**
|
* **Text Position Extraction:**
|
||||||
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
|
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
|
||||||
|
|
||||||
* **Word Segmentation and Text Block Formation:**
|
* **Word Segmentation and Text Block Formation:**
|
||||||
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
|
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
|
||||||
|
|
||||||
* **Text Block Classification:**
|
* **Text Block Classification:**
|
||||||
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties, distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
|
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties,
|
||||||
|
distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
|
||||||
|
|
||||||
* **Layout Coherence Establishment:**
|
* **Layout Coherence Establishment:**
|
||||||
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other elements in a logical and structured manner.
|
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other
|
||||||
|
elements in a logical and structured manner.
|
||||||
|
|
||||||
* **Output Generation in Various Formats:**
|
* **Output Generation in Various Formats:**
|
||||||
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
|
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream
|
||||||
|
micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
|
||||||
|
|
||||||
### Optional Enhancements:
|
### Optional Enhancements:
|
||||||
|
|
||||||
* **ML-Based Table Extraction:**
|
* **ML-Based Table Extraction:**
|
||||||
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON file, which are then integrated seamlessly into the layout structure.
|
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON
|
||||||
|
file, which are then integrated seamlessly into the layout structure.
|
||||||
|
|
||||||
* **Image Classification using ML:**
|
* **Image Classification using ML:**
|
||||||
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
|
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction
|
||||||
|
option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
|
||||||
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It
|
||||||
|
proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream
|
||||||
|
micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -49,41 +54,55 @@ To build and test the micro-service, follow these steps:
|
|||||||
### Clone the Repository:
|
### Clone the Repository:
|
||||||
|
|
||||||
bash
|
bash
|
||||||
|
|
||||||
```
|
```
|
||||||
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
|
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
|
||||||
cd layout-parser
|
cd layout-parser
|
||||||
```
|
```
|
||||||
|
|
||||||
### Build the Project:
|
### Build the Project:
|
||||||
|
|
||||||
Use the following command to build the project using Gradle:
|
Use the following command to build the project using Gradle:
|
||||||
|
|
||||||
```
|
```
|
||||||
gradle clean build
|
gradle clean build
|
||||||
```
|
```
|
||||||
|
|
||||||
### Run Tests:
|
### Run Tests:
|
||||||
|
|
||||||
Run the test suite using the following command:
|
Run the test suite using the following command:
|
||||||
|
|
||||||
```
|
```
|
||||||
gradle test
|
gradle test
|
||||||
```
|
```
|
||||||
|
|
||||||
## Building a Custom Docker Image
|
## Building a Custom Docker Image
|
||||||
|
|
||||||
To create a custom Docker image for the layout-parser micro-service, execute the provided script:
|
To create a custom Docker image for the layout-parser micro-service, execute the provided script:
|
||||||
|
|
||||||
### Ensure Docker is Installed:
|
### Ensure Docker is Installed:
|
||||||
|
|
||||||
Ensure that Docker is installed and running on your system.
|
Ensure that Docker is installed and running on your system.
|
||||||
|
|
||||||
### Run the Image Building Script:
|
### Run the Image Building Script:
|
||||||
|
|
||||||
Execute the publish-custom-image script in the project directory:
|
Execute the publish-custom-image script in the project directory:
|
||||||
|
|
||||||
```
|
```
|
||||||
./publish-custom-image
|
./publish-custom-image
|
||||||
```
|
```
|
||||||
|
|
||||||
## Publishing to Internal Maven Repository
|
## Publishing to Internal Maven Repository
|
||||||
|
|
||||||
To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
|
To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
|
||||||
|
|
||||||
```
|
```
|
||||||
gradle -Pversion=buildVersion publish
|
gradle -Pversion=buildVersion publish
|
||||||
```
|
```
|
||||||
|
|
||||||
Replace buildVersion with the desired version number.
|
Replace buildVersion with the desired version number.
|
||||||
|
|
||||||
## Additional Notes
|
## Additional Notes
|
||||||
|
|
||||||
Make sure to configure any necessary application properties before deploying the micro-service.
|
Make sure to configure any necessary application properties before deploying the micro-service.
|
||||||
For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.
|
For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.
|
||||||
|
|||||||
@ -1 +1 @@
|
|||||||
version = 0.1-SNAPSHOT
|
version = 0.1 - SNAPSHOT
|
||||||
|
|||||||
@ -25,7 +25,6 @@ public class DocumentStructure {
|
|||||||
@Schema(description = "The root EntryData represents the Document.")
|
@Schema(description = "The root EntryData represents the Document.")
|
||||||
EntryData root;
|
EntryData root;
|
||||||
|
|
||||||
|
|
||||||
@Schema(description = "Object containing the extra field names, a table has in its properties field.")
|
@Schema(description = "Object containing the extra field names, a table has in its properties field.")
|
||||||
public static class TableProperties {
|
public static class TableProperties {
|
||||||
|
|
||||||
@ -56,6 +55,7 @@ public class DocumentStructure {
|
|||||||
|
|
||||||
public static final String RECTANGLE_DELIMITER = ";";
|
public static final String RECTANGLE_DELIMITER = ";";
|
||||||
|
|
||||||
|
|
||||||
public static Rectangle2D parseRectangle2D(String bBox) {
|
public static Rectangle2D parseRectangle2D(String bBox) {
|
||||||
|
|
||||||
List<Float> floats = Arrays.stream(bBox.split(RECTANGLE_DELIMITER)).map(Float::parseFloat).toList();
|
List<Float> floats = Arrays.stream(bBox.split(RECTANGLE_DELIMITER)).map(Float::parseFloat).toList();
|
||||||
|
|||||||
@ -17,4 +17,5 @@ public class RowData {
|
|||||||
List<ParagraphData> cellText;
|
List<ParagraphData> cellText;
|
||||||
@Schema(description = "The bounding box of this StructureObject. Is always exactly 4 values representing x, y, w, h, where x, y specify the lower left corner.")
|
@Schema(description = "The bounding box of this StructureObject. Is always exactly 4 values representing x, y, w, h, where x, y specify the lower left corner.")
|
||||||
float[] bBox;
|
float[] bBox;
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -8,13 +8,9 @@ import lombok.Builder;
|
|||||||
@Builder
|
@Builder
|
||||||
@Schema(description = "Object containing information about the layout parsing.")
|
@Schema(description = "Object containing information about the layout parsing.")
|
||||||
public record LayoutParsingFinishedEvent(
|
public record LayoutParsingFinishedEvent(
|
||||||
@Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.")
|
@Schema(description = "General purpose identifier. It is returned exactly the same way it is inserted with the LayoutParsingRequest.") Map<String, String> identifier,//
|
||||||
Map<String, String> identifier,//
|
@Schema(description = "The duration of a single layout parsing in ms.") long duration,//
|
||||||
@Schema(description = "The duration of a single layout parsing in ms.")
|
@Schema(description = "The number of pages of the parsed document.") int numberOfPages,//
|
||||||
long duration,//
|
@Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.") String message) {
|
||||||
@Schema(description = "The number of pages of the parsed document.")
|
|
||||||
int numberOfPages,//
|
|
||||||
@Schema(description = "A general message. It contains some information useful for a developer, like the paths where the files are stored. Not meant to be machine readable.")
|
|
||||||
String message) {
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -5,4 +5,5 @@ public class LayoutParsingQueueNames {
|
|||||||
public static final String LAYOUT_PARSING_REQUEST_QUEUE = "layout_parsing_request_queue";
|
public static final String LAYOUT_PARSING_REQUEST_QUEUE = "layout_parsing_request_queue";
|
||||||
public static final String LAYOUT_PARSING_DLQ = "layout_parsing_dead_letter_queue";
|
public static final String LAYOUT_PARSING_DLQ = "layout_parsing_dead_letter_queue";
|
||||||
public static final String LAYOUT_PARSING_FINISHED_EVENT_QUEUE = "layout_parsing_response_queue";
|
public static final String LAYOUT_PARSING_FINISHED_EVENT_QUEUE = "layout_parsing_response_queue";
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -17,24 +17,35 @@ public record LayoutParsingRequest(
|
|||||||
Map<String, String> identifier,
|
Map<String, String> identifier,
|
||||||
|
|
||||||
@Schema(description = "Path to the original PDF file.")//
|
@Schema(description = "Path to the original PDF file.")//
|
||||||
@NonNull String originFileStorageId,//
|
@NonNull String originFileStorageId,
|
||||||
|
//
|
||||||
|
|
||||||
|
@Schema(description = "Optional Path to the the visual layout parsing service file") Optional<String> visualLayoutParsingFileId,
|
||||||
@Schema(description = "Optional Path to the table extraction file.")//
|
@Schema(description = "Optional Path to the table extraction file.")//
|
||||||
Optional<String> tablesFileStorageId,//
|
Optional<String> tablesFileStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Optional Path to the image classification file.")//
|
@Schema(description = "Optional Path to the image classification file.")//
|
||||||
Optional<String> imagesFileStorageId,//
|
Optional<String> imagesFileStorageId,
|
||||||
|
//
|
||||||
|
|
||||||
@Schema(description = "Path where the Document Structure File will be stored.")//
|
@Schema(description = "Path where the Document Structure File will be stored.")//
|
||||||
@NonNull String structureFileStorageId,//
|
@NonNull String structureFileStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Research Data File will be stored.")//
|
@Schema(description = "Path where the Research Data File will be stored.")//
|
||||||
String researchDocumentStorageId,//
|
String researchDocumentStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Document Text File will be stored.")//
|
@Schema(description = "Path where the Document Text File will be stored.")//
|
||||||
@NonNull String textBlockFileStorageId,//
|
@NonNull String textBlockFileStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Document Positions File will be stored.")//
|
@Schema(description = "Path where the Document Positions File will be stored.")//
|
||||||
@NonNull String positionBlockFileStorageId,//
|
@NonNull String positionBlockFileStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Document Pages File will be stored.")//
|
@Schema(description = "Path where the Document Pages File will be stored.")//
|
||||||
@NonNull String pageFileStorageId,//
|
@NonNull String pageFileStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Simplified Text File will be stored.")//
|
@Schema(description = "Path where the Simplified Text File will be stored.")//
|
||||||
@NonNull String simplifiedTextStorageId,//
|
@NonNull String simplifiedTextStorageId,
|
||||||
|
//
|
||||||
@Schema(description = "Path where the Viewer Document PDF will be stored.")//
|
@Schema(description = "Path where the Viewer Document PDF will be stored.")//
|
||||||
@NonNull String viewerDocumentStorageId) {
|
@NonNull String viewerDocumentStorageId) {
|
||||||
|
|
||||||
|
|||||||
@ -30,9 +30,12 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageB
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.CvTableParsingAdapter;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.CvTableParsingAdapter;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.ImageServiceResponseAdapter;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.ImageServiceResponseAdapter;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.adapter.VisualLayoutParsingAdapter;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.RulingCleaningService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.RulingCleaningService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
|
||||||
@ -71,6 +74,7 @@ public class LayoutParsingPipeline {
|
|||||||
private final BodyTextFrameService bodyTextFrameService;
|
private final BodyTextFrameService bodyTextFrameService;
|
||||||
private final RulingCleaningService rulingCleaningService;
|
private final RulingCleaningService rulingCleaningService;
|
||||||
private final TableExtractionService tableExtractionService;
|
private final TableExtractionService tableExtractionService;
|
||||||
|
private final VisualLayoutParsingAdapter visualLayoutParsingAdapter;
|
||||||
private final TaasBlockificationService taasBlockificationService;
|
private final TaasBlockificationService taasBlockificationService;
|
||||||
private final DocuMineBlockificationService docuMineBlockificationService;
|
private final DocuMineBlockificationService docuMineBlockificationService;
|
||||||
private final RedactManagerBlockificationService redactManagerBlockificationService;
|
private final RedactManagerBlockificationService redactManagerBlockificationService;
|
||||||
@ -92,6 +96,11 @@ public class LayoutParsingPipeline {
|
|||||||
tableServiceResponse = layoutParsingStorageService.getTablesFile(layoutParsingRequest.tablesFileStorageId().get());
|
tableServiceResponse = layoutParsingStorageService.getTablesFile(layoutParsingRequest.tablesFileStorageId().get());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
VisualLayoutParsingResponse visualLayoutParsingResponse = new VisualLayoutParsingResponse();
|
||||||
|
if (layoutParsingRequest.visualLayoutParsingFileId().isPresent()) {
|
||||||
|
visualLayoutParsingResponse = layoutParsingStorageService.getExtractedTablesFile(layoutParsingRequest.visualLayoutParsingFileId().get());
|
||||||
|
}
|
||||||
|
|
||||||
ClassificationDocument classificationDocument = parseLayout(layoutParsingRequest.layoutParsingType(), originDocument, imageServiceResponse, tableServiceResponse);
|
ClassificationDocument classificationDocument = parseLayout(layoutParsingRequest.layoutParsingType(), originDocument, imageServiceResponse, tableServiceResponse);
|
||||||
Document documentGraph = DocumentGraphFactory.buildDocumentGraph(classificationDocument);
|
Document documentGraph = DocumentGraphFactory.buildDocumentGraph(classificationDocument);
|
||||||
|
|
||||||
@ -100,8 +109,9 @@ public class LayoutParsingPipeline {
|
|||||||
layoutParsingStorageService.storeDocumentData(layoutParsingRequest, DocumentDataMapper.toDocumentData(documentGraph));
|
layoutParsingStorageService.storeDocumentData(layoutParsingRequest, DocumentDataMapper.toDocumentData(documentGraph));
|
||||||
layoutParsingStorageService.storeSimplifiedText(layoutParsingRequest, simplifiedSectionTextService.toSimplifiedText(documentGraph));
|
layoutParsingStorageService.storeSimplifiedText(layoutParsingRequest, simplifiedSectionTextService.toSimplifiedText(documentGraph));
|
||||||
|
|
||||||
|
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells = visualLayoutParsingAdapter.buildExtractedTablesPerPage(visualLayoutParsingResponse);
|
||||||
try (var out = new ByteArrayOutputStream()) {
|
try (var out = new ByteArrayOutputStream()) {
|
||||||
viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, false);
|
viewerDocumentService.createViewerDocument(originDocument, documentGraph, out, extractedTableCells, false);
|
||||||
layoutParsingStorageService.storeViewerDocument(layoutParsingRequest, out);
|
layoutParsingStorageService.storeViewerDocument(layoutParsingRequest, out);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -244,9 +254,9 @@ public class LayoutParsingPipeline {
|
|||||||
|
|
||||||
private void increaseDocumentStatistics(ClassificationPage classificationPage, ClassificationDocument document) {
|
private void increaseDocumentStatistics(ClassificationPage classificationPage, ClassificationDocument document) {
|
||||||
|
|
||||||
if (!classificationPage.isLandscape()) {
|
if (!classificationPage.isLandscape()) {
|
||||||
document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue());
|
document.getFontSizeCounter().addAll(classificationPage.getFontSizeCounter().getCountPerValue());
|
||||||
}
|
}
|
||||||
document.getFontCounter().addAll(classificationPage.getFontCounter().getCountPerValue());
|
document.getFontCounter().addAll(classificationPage.getFontCounter().getCountPerValue());
|
||||||
document.getTextHeightCounter().addAll(classificationPage.getTextHeightCounter().getCountPerValue());
|
document.getTextHeightCounter().addAll(classificationPage.getTextHeightCounter().getCountPerValue());
|
||||||
document.getFontStyleCounter().addAll(classificationPage.getFontStyleCounter().getCountPerValue());
|
document.getFontStyleCounter().addAll(classificationPage.getFontStyleCounter().getCountPerValue());
|
||||||
|
|||||||
@ -24,6 +24,7 @@ import com.knecon.fforesight.service.layoutparser.internal.api.data.taas.Researc
|
|||||||
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingRequest;
|
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingRequest;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
|
||||||
import com.knecon.fforesight.tenantcommons.TenantContext;
|
import com.knecon.fforesight.tenantcommons.TenantContext;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
@ -74,6 +75,16 @@ public class LayoutParsingStorageService {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public VisualLayoutParsingResponse getExtractedTablesFile(String storageId) throws IOException {
|
||||||
|
|
||||||
|
try (InputStream inputStream = getObject(storageId)) {
|
||||||
|
VisualLayoutParsingResponse visualLayoutParsingResponse = objectMapper.readValue(inputStream, VisualLayoutParsingResponse.class);
|
||||||
|
inputStream.close();
|
||||||
|
return visualLayoutParsingResponse;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
public void storeDocumentData(LayoutParsingRequest layoutParsingRequest, DocumentData documentData) {
|
public void storeDocumentData(LayoutParsingRequest layoutParsingRequest, DocumentData documentData) {
|
||||||
|
|
||||||
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.structureFileStorageId(), documentData.getDocumentStructure());
|
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.structureFileStorageId(), documentData.getDocumentStructure());
|
||||||
@ -83,7 +94,6 @@ public class LayoutParsingStorageService {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
public void storeResearchDocumentData(LayoutParsingRequest layoutParsingRequest, ResearchDocumentData researchDocumentData) {
|
public void storeResearchDocumentData(LayoutParsingRequest layoutParsingRequest, ResearchDocumentData researchDocumentData) {
|
||||||
|
|
||||||
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.researchDocumentStorageId(), researchDocumentData);
|
storageService.storeJSONObject(TenantContext.getTenantId(), layoutParsingRequest.researchDocumentStorageId(), researchDocumentData);
|
||||||
|
|||||||
@ -14,7 +14,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.StringFre
|
|||||||
import lombok.Data;
|
import lombok.Data;
|
||||||
import lombok.NonNull;
|
import lombok.NonNull;
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
|
|
||||||
|
|
||||||
@Data
|
@Data
|
||||||
@RequiredArgsConstructor
|
@RequiredArgsConstructor
|
||||||
|
|||||||
@ -19,4 +19,5 @@ public class PageContents {
|
|||||||
Rectangle2D cropBox;
|
Rectangle2D cropBox;
|
||||||
Rectangle2D mediaBox;
|
Rectangle2D mediaBox;
|
||||||
List<Ruling> rulings;
|
List<Ruling> rulings;
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -108,11 +108,13 @@ public class Boundary implements Comparable<Boundary> {
|
|||||||
return splitBoundaries;
|
return splitBoundaries;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public IntStream intStream() {
|
public IntStream intStream() {
|
||||||
|
|
||||||
return IntStream.range(start, end);
|
return IntStream.range(start, end);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static Boundary merge(Collection<Boundary> boundaries) {
|
public static Boundary merge(Collection<Boundary> boundaries) {
|
||||||
|
|
||||||
int minStart = boundaries.stream().mapToInt(Boundary::start).min().orElseThrow(IllegalArgumentException::new);
|
int minStart = boundaries.stream().mapToInt(Boundary::start).min().orElseThrow(IllegalArgumentException::new);
|
||||||
|
|||||||
@ -105,6 +105,7 @@ public class Document implements GenericSemanticNode {
|
|||||||
return streamAllSubNodes().collect(Collectors.groupingBy(SemanticNode::getType, Collectors.counting()));
|
return streamAllSubNodes().collect(Collectors.groupingBy(SemanticNode::getType, Collectors.counting()));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public String toString() {
|
public String toString() {
|
||||||
|
|
||||||
|
|||||||
@ -207,6 +207,7 @@ public class Table implements SemanticNode {
|
|||||||
return IntStream.range(0, numberOfCols).boxed().map(col -> getCell(row, col));
|
return IntStream.range(0, numberOfCols).boxed().map(col -> getCell(row, col));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Streams all TableCells row-wise and filters them with header == true.
|
* Streams all TableCells row-wise and filters them with header == true.
|
||||||
*
|
*
|
||||||
|
|||||||
@ -109,10 +109,7 @@ public class AtomicTextBlock implements TextBlock {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData,
|
public static AtomicTextBlock fromAtomicTextBlockData(DocumentTextData documentTextData, DocumentPositionData documentPositionData, SemanticNode parent, Page page) {
|
||||||
DocumentPositionData documentPositionData,
|
|
||||||
SemanticNode parent,
|
|
||||||
Page page) {
|
|
||||||
|
|
||||||
return AtomicTextBlock.builder()
|
return AtomicTextBlock.builder()
|
||||||
.id(documentTextData.getId())
|
.id(documentTextData.getId())
|
||||||
|
|||||||
@ -1,14 +1,12 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.processor.model.table;
|
package com.knecon.fforesight.service.layoutparser.processor.model.table;
|
||||||
|
|
||||||
import java.awt.geom.Point2D;
|
import java.awt.geom.Point2D;
|
||||||
import java.awt.geom.Rectangle2D;
|
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.Collections;
|
import java.util.Collections;
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
import java.util.TreeMap;
|
import java.util.TreeMap;
|
||||||
import java.util.stream.Collectors;
|
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
||||||
@ -50,6 +48,7 @@ public class TablePageBlock extends AbstractPageBlock {
|
|||||||
return getColCount() == 0 || getRowCount() == 0;
|
return getColCount() == 0 || getRowCount() == 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public List<List<Cell>> getRows() {
|
public List<List<Cell>> getRows() {
|
||||||
|
|
||||||
if (rows == null) {
|
if (rows == null) {
|
||||||
@ -276,21 +275,17 @@ public class TablePageBlock extends AbstractPageBlock {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
public boolean intersects(Cell cell1, Cell cell2) {
|
public boolean intersects(Cell cell1, Cell cell2) {
|
||||||
|
|
||||||
if (cell1.getHeight() <= 0 || cell2.getHeight() <= 0) {
|
if (cell1.getHeight() <= 0 || cell2.getHeight() <= 0) {
|
||||||
return false;
|
return false;
|
||||||
}
|
}
|
||||||
double x0 = cell1.getX() + 2;
|
double x0 = cell1.getX() + 2;
|
||||||
double y0 = cell1.getY() + 2;
|
double y0 = cell1.getY() + 2;
|
||||||
return (cell2.x + cell2.width > x0 &&
|
return (cell2.x + cell2.width > x0 && cell2.y + cell2.height > y0 && cell2.x < x0 + cell1.getWidth() - 2 && cell2.y < y0 + cell1.getHeight() - 2);
|
||||||
cell2.y + cell2.height > y0 &&
|
|
||||||
cell2.x < x0 + cell1.getWidth() -2 &&
|
|
||||||
cell2.y < y0 + cell1.getHeight() -2);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public String getText() {
|
public String getText() {
|
||||||
|
|
||||||
@ -328,8 +323,6 @@ public class TablePageBlock extends AbstractPageBlock {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
public String getTextAsHtml() {
|
public String getTextAsHtml() {
|
||||||
|
|
||||||
StringBuilder sb = new StringBuilder();
|
StringBuilder sb = new StringBuilder();
|
||||||
|
|||||||
@ -2,6 +2,7 @@ package com.knecon.fforesight.service.layoutparser.processor.model.text;
|
|||||||
|
|
||||||
import java.util.ArrayList;
|
import java.util.ArrayList;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.TextNormalizationUtilities;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.TextNormalizationUtilities;
|
||||||
|
|
||||||
import lombok.Getter;
|
import lombok.Getter;
|
||||||
|
|||||||
@ -82,6 +82,7 @@ public class TextPageBlock extends AbstractPageBlock {
|
|||||||
return fromTextPositionSequences(sequences);
|
return fromTextPositionSequences(sequences);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static TextPageBlock fromTextPositionSequences(List<TextPositionSequence> wordBlockList) {
|
public static TextPageBlock fromTextPositionSequences(List<TextPositionSequence> wordBlockList) {
|
||||||
|
|
||||||
TextPageBlock textBlock = null;
|
TextPageBlock textBlock = null;
|
||||||
@ -133,7 +134,6 @@ public class TextPageBlock extends AbstractPageBlock {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Returns the minX value in pdf coordinate system.
|
* Returns the minX value in pdf coordinate system.
|
||||||
* Note: This needs to use Pdf Coordinate System where {0,0} rotated with the page rotation.
|
* Note: This needs to use Pdf Coordinate System where {0,0} rotated with the page rotation.
|
||||||
|
|||||||
@ -234,6 +234,7 @@ public class TextPositionSequence implements CharSequence {
|
|||||||
@JsonIgnore
|
@JsonIgnore
|
||||||
@JsonAttribute(ignore = true)
|
@JsonAttribute(ignore = true)
|
||||||
public String getFontStyle() {
|
public String getFontStyle() {
|
||||||
|
|
||||||
if (textPositions.get(0).getFontName() == null) {
|
if (textPositions.get(0).getFontName() == null) {
|
||||||
return "standard";
|
return "standard";
|
||||||
}
|
}
|
||||||
|
|||||||
@ -9,10 +9,10 @@ import java.util.Map;
|
|||||||
|
|
||||||
import org.springframework.stereotype.Service;
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
|
|
||||||
@ -20,8 +20,7 @@ import lombok.RequiredArgsConstructor;
|
|||||||
@RequiredArgsConstructor
|
@RequiredArgsConstructor
|
||||||
public class ImageServiceResponseAdapter {
|
public class ImageServiceResponseAdapter {
|
||||||
|
|
||||||
|
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse) {
|
||||||
public Map<Integer, List<ClassifiedImage>> buildClassifiedImagesPerPage(ImageServiceResponse imageServiceResponse ) {
|
|
||||||
|
|
||||||
Map<Integer, List<ClassifiedImage>> images = new HashMap<>();
|
Map<Integer, List<ClassifiedImage>> images = new HashMap<>();
|
||||||
imageServiceResponse.getData().forEach(imageMetadata -> {
|
imageServiceResponse.getData().forEach(imageMetadata -> {
|
||||||
|
|||||||
@ -0,0 +1,52 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.adapter;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingBox;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResponse;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
|
||||||
|
|
||||||
|
import lombok.RequiredArgsConstructor;
|
||||||
|
|
||||||
|
@Service
|
||||||
|
@RequiredArgsConstructor
|
||||||
|
public class VisualLayoutParsingAdapter {
|
||||||
|
|
||||||
|
public Map<Integer, List<VisualLayoutParsingResult>> buildExtractedTablesPerPage(VisualLayoutParsingResponse visualLayoutParsingResponse) {
|
||||||
|
|
||||||
|
Map<Integer, List<VisualLayoutParsingResult>> tableCells = new HashMap<>();
|
||||||
|
visualLayoutParsingResponse.getData()
|
||||||
|
.forEach(tableData -> tableCells.computeIfAbsent(tableData.getPage_idx(), tableCell -> new ArrayList<>()).addAll(convertTableCells(tableData.getBoxes())));
|
||||||
|
|
||||||
|
return tableCells;
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public List<VisualLayoutParsingResult> convertTableCells(List<VisualLayoutParsingBox> tableObjects) {
|
||||||
|
|
||||||
|
List<VisualLayoutParsingResult> parsedTableCells = new ArrayList<>();
|
||||||
|
|
||||||
|
tableObjects.stream().forEach(t -> {
|
||||||
|
VisualLayoutParsingResult result = new VisualLayoutParsingResult();
|
||||||
|
result.setX0(t.getBox().getX1());
|
||||||
|
result.setX1(t.getBox().getX2());
|
||||||
|
result.setY0(t.getBox().getY1());
|
||||||
|
result.setY1(t.getBox().getY2());
|
||||||
|
result.setWidth(result.getX1() - result.getX0());
|
||||||
|
result.setHeight(result.getY1() - result.getY0());
|
||||||
|
result.setLabel(t.getLabel());
|
||||||
|
parsedTableCells.add(result);
|
||||||
|
});
|
||||||
|
|
||||||
|
return parsedTableCells;
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,5 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
public class ExtractedTable {
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,20 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Data;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
@Builder
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
public class VisualLayoutParsingBox {
|
||||||
|
|
||||||
|
private VisualLayoutParsingBoxValue box;
|
||||||
|
private String label;
|
||||||
|
private float probability;
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,19 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Data;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
@Builder
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
public class VisualLayoutParsingBoxValue {
|
||||||
|
|
||||||
|
private float x1;
|
||||||
|
private float y1;
|
||||||
|
private float x2;
|
||||||
|
private float y2;
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,20 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Data;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
@Builder
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
public class VisualLayoutParsingData {
|
||||||
|
|
||||||
|
private int page_idx;
|
||||||
|
|
||||||
|
private List<VisualLayoutParsingBox> boxes;
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,23 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
import java.util.List;
|
||||||
|
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Data;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
@Builder
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
public class VisualLayoutParsingResponse {
|
||||||
|
|
||||||
|
private String dossierId;
|
||||||
|
private String fileId;
|
||||||
|
private String targetFileExtension;
|
||||||
|
private String responseFileExtension;
|
||||||
|
private String X_TENANT_ID;
|
||||||
|
private List<VisualLayoutParsingData> data;
|
||||||
|
|
||||||
|
}
|
||||||
@ -0,0 +1,22 @@
|
|||||||
|
package com.knecon.fforesight.service.layoutparser.processor.python_api.model.table;
|
||||||
|
|
||||||
|
import lombok.AllArgsConstructor;
|
||||||
|
import lombok.Builder;
|
||||||
|
import lombok.Data;
|
||||||
|
import lombok.NoArgsConstructor;
|
||||||
|
|
||||||
|
@Data
|
||||||
|
@Builder
|
||||||
|
@NoArgsConstructor
|
||||||
|
@AllArgsConstructor
|
||||||
|
public class VisualLayoutParsingResult {
|
||||||
|
|
||||||
|
private float x0;
|
||||||
|
private float y0;
|
||||||
|
private float x1;
|
||||||
|
private float y1;
|
||||||
|
private float width;
|
||||||
|
private float height;
|
||||||
|
private String label;
|
||||||
|
|
||||||
|
}
|
||||||
@ -25,6 +25,7 @@ public class BodyTextFrameService {
|
|||||||
private static final float RULING_HEIGHT_THRESHOLD = 0.15f; // multiplied with page height. Header/Footer Rulings must be within that border of the page.
|
private static final float RULING_HEIGHT_THRESHOLD = 0.15f; // multiplied with page height. Header/Footer Rulings must be within that border of the page.
|
||||||
private static final float RULING_WIDTH_THRESHOLD = 0.75f; // multiplied with page width. Header/Footer Rulings must be at least that wide.
|
private static final float RULING_WIDTH_THRESHOLD = 0.75f; // multiplied with page width. Header/Footer Rulings must be at least that wide.
|
||||||
|
|
||||||
|
|
||||||
public void setBodyTextFrames(ClassificationDocument classificationDocument, LayoutParsingType layoutParsingType) {
|
public void setBodyTextFrames(ClassificationDocument classificationDocument, LayoutParsingType layoutParsingType) {
|
||||||
|
|
||||||
Rectangle bodyTextFrame = calculateBodyTextFrame(classificationDocument.getPages(), classificationDocument.getFontSizeCounter(), false, layoutParsingType);
|
Rectangle bodyTextFrame = calculateBodyTextFrame(classificationDocument.getPages(), classificationDocument.getFontSizeCounter(), false, layoutParsingType);
|
||||||
@ -155,8 +156,9 @@ public class BodyTextFrameService {
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || MarkedContentUtils.intersects(textBlock,
|
||||||
|| MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)) {
|
page.getMarkedContentBboxPerType(),
|
||||||
|
MarkedContentUtils.FOOTER)) {
|
||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@ -22,7 +22,6 @@ public class DividingColumnDetectionService {
|
|||||||
|
|
||||||
public List<Rectangle2D> detectColumns(PageContents pageContents) {
|
public List<Rectangle2D> detectColumns(PageContents pageContents) {
|
||||||
|
|
||||||
|
|
||||||
if (pageContents.getSortedTextPositionSequences().size() < 2) {
|
if (pageContents.getSortedTextPositionSequences().size() < 2) {
|
||||||
return List.of(pageContents.getCropBox());
|
return List.of(pageContents.getCropBox());
|
||||||
}
|
}
|
||||||
|
|||||||
@ -72,11 +72,13 @@ public class GapDetectionService {
|
|||||||
return mirrorY(RectangleTransformations.toRectangle2D(textPosition.getRectangle()));
|
return mirrorY(RectangleTransformations.toRectangle2D(textPosition.getRectangle()));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private static Rectangle2D mirrorY(Rectangle2D rectangle2D) {
|
private static Rectangle2D mirrorY(Rectangle2D rectangle2D) {
|
||||||
|
|
||||||
return new Rectangle2D.Double(rectangle2D.getX(), Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY()), rectangle2D.getWidth(), Math.abs(rectangle2D.getHeight()));
|
return new Rectangle2D.Double(rectangle2D.getX(), Math.min(rectangle2D.getMinY(), rectangle2D.getMaxY()), rectangle2D.getWidth(), Math.abs(rectangle2D.getHeight()));
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private static void addGapToLine(Rectangle2D currentTextPosition, Rectangle2D previousTextPosition, XGapsContext context) {
|
private static void addGapToLine(Rectangle2D currentTextPosition, Rectangle2D previousTextPosition, XGapsContext context) {
|
||||||
|
|
||||||
context.gapsInCurrentLine.add(new Rectangle2D.Double(previousTextPosition.getMaxX(),
|
context.gapsInCurrentLine.add(new Rectangle2D.Double(previousTextPosition.getMaxX(),
|
||||||
|
|||||||
@ -6,7 +6,6 @@ import java.util.LinkedList;
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Queue;
|
import java.util.Queue;
|
||||||
import java.util.stream.Stream;
|
import java.util.stream.Stream;
|
||||||
import com.iqser.red.commons.jackson.ObjectMapperFactory;
|
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
|
||||||
|
|
||||||
@ -51,7 +50,9 @@ public class GapsAcrossLinesService {
|
|||||||
}
|
}
|
||||||
|
|
||||||
return columnFactory.outputGaps.stream()
|
return columnFactory.outputGaps.stream()
|
||||||
.filter(gapAcrossLines -> columnFactory.outputGaps.stream().filter(gapAcrossLines::intersectsX).noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount))
|
.filter(gapAcrossLines -> columnFactory.outputGaps.stream()
|
||||||
|
.filter(gapAcrossLines::intersectsX)
|
||||||
|
.noneMatch(gapAcrossLines1 -> gapAcrossLines1.lineCount > gapAcrossLines.lineCount))
|
||||||
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMinX() - mainBodyTextFrame.getMinX()) > DISTANCE_TO_BORDER_THRESHOLD)
|
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMinX() - mainBodyTextFrame.getMinX()) > DISTANCE_TO_BORDER_THRESHOLD)
|
||||||
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMaxX() - mainBodyTextFrame.getMaxX()) > DISTANCE_TO_BORDER_THRESHOLD)
|
.filter(gapAcrossLines -> Math.abs(gapAcrossLines.rectangle2D.getMaxX() - mainBodyTextFrame.getMaxX()) > DISTANCE_TO_BORDER_THRESHOLD)
|
||||||
.map(GapAcrossLines::getRectangle2D)
|
.map(GapAcrossLines::getRectangle2D)
|
||||||
|
|||||||
@ -6,8 +6,8 @@ import java.util.List;
|
|||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.GapInformation;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.LineInformation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.LineInformation;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
||||||
|
|
||||||
import lombok.AllArgsConstructor;
|
import lombok.AllArgsConstructor;
|
||||||
import lombok.Getter;
|
import lombok.Getter;
|
||||||
|
|||||||
@ -16,8 +16,7 @@ public class MainBodyTextFrameExtractionService {
|
|||||||
|
|
||||||
public Rectangle2D calculateMainBodyTextFrame(LineInformation lineInformation) {
|
public Rectangle2D calculateMainBodyTextFrame(LineInformation lineInformation) {
|
||||||
|
|
||||||
Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream()
|
Rectangle2D mainBodyTextFrame = lineInformation.getLineBBox().stream().collect(RectangleTransformations.collectBBox());
|
||||||
.collect(RectangleTransformations.collectBBox());
|
|
||||||
|
|
||||||
return RectangleTransformations.pad(mainBodyTextFrame, mainBodyTextFrame.getWidth() * TEXT_FRAME_PAD_WIDTH, mainBodyTextFrame.getHeight() * TEXT_FRAME_PAD_HEIGHT);
|
return RectangleTransformations.pad(mainBodyTextFrame, mainBodyTextFrame.getWidth() * TEXT_FRAME_PAD_WIDTH, mainBodyTextFrame.getHeight() * TEXT_FRAME_PAD_HEIGHT);
|
||||||
}
|
}
|
||||||
|
|||||||
@ -5,9 +5,9 @@ import java.util.List;
|
|||||||
import org.springframework.stereotype.Service;
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedSectionText;
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedSectionText;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.SimplifiedText;
|
|
||||||
|
|
||||||
@Service
|
@Service
|
||||||
public class SimplifiedSectionTextService {
|
public class SimplifiedSectionTextService {
|
||||||
@ -23,4 +23,5 @@ public class SimplifiedSectionTextService {
|
|||||||
|
|
||||||
return SimplifiedSectionText.builder().sectionNumber(section.getTreeId().get(0)).text(section.getTextBlock().getSearchText()).build();
|
return SimplifiedSectionText.builder().sectionNumber(section.getTreeId().get(0)).text(section.getTextBlock().getSearchText()).build();
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -1,9 +1,20 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.processor.services.blockification;
|
package com.knecon.fforesight.service.layoutparser.processor.services.blockification;
|
||||||
|
|
||||||
|
|
||||||
// TODO: figure out, why this fails the build
|
// TODO: figure out, why this fails the build
|
||||||
// import static com.knecon.fforesight.service.layoutparser.processor.services.factory.SearchTextWithTextPositionFactory.HEIGHT_PADDING;
|
// import static com.knecon.fforesight.service.layoutparser.processor.services.factory.SearchTextWithTextPositionFactory.HEIGHT_PADDING;
|
||||||
|
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.HashSet;
|
||||||
|
import java.util.Iterator;
|
||||||
|
import java.util.LinkedList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Set;
|
||||||
|
import java.util.regex.Matcher;
|
||||||
|
import java.util.regex.Pattern;
|
||||||
|
import java.util.stream.Stream;
|
||||||
|
|
||||||
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.Orientation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.Orientation;
|
||||||
@ -11,12 +22,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.table.Ruling;
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.RulingTextDirAdjustUtil;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.RulingTextDirAdjustUtil;
|
||||||
import org.springframework.stereotype.Service;
|
|
||||||
|
|
||||||
import java.util.*;
|
|
||||||
import java.util.regex.Matcher;
|
|
||||||
import java.util.regex.Pattern;
|
|
||||||
import java.util.stream.Stream;
|
|
||||||
|
|
||||||
@Service
|
@Service
|
||||||
@SuppressWarnings("all")
|
@SuppressWarnings("all")
|
||||||
@ -83,13 +88,13 @@ public class TaasBlockificationService {
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
Matcher listIdentifierPattern = listIdentifier.matcher(currentTextBlock.getText());
|
Matcher listIdentifierPattern = listIdentifier.matcher(currentTextBlock.getText());
|
||||||
boolean isListIdentifier = listIdentifierPattern.find();
|
boolean isListIdentifier = listIdentifierPattern.find();
|
||||||
|
|
||||||
boolean yGap = Math.abs(currentTextBlock.getPdfMaxY() - previousTextBlock.getPdfMinY()) < previousTextBlock.getMostPopularWordHeight() * Y_GAP_SPLIT_HEIGHT_MODIFIER;
|
boolean yGap = Math.abs(currentTextBlock.getPdfMaxY() - previousTextBlock.getPdfMinY()) < previousTextBlock.getMostPopularWordHeight() * Y_GAP_SPLIT_HEIGHT_MODIFIER;
|
||||||
|
|
||||||
boolean sameFont = previousTextBlock.getMostPopularWordFont().equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize();
|
boolean sameFont = previousTextBlock.getMostPopularWordFont()
|
||||||
|
.equals(currentTextBlock.getMostPopularWordFont()) && previousTextBlock.getMostPopularWordFontSize() == currentTextBlock.getMostPopularWordFontSize();
|
||||||
// boolean yGap = previousTextBlock != null && currentTextBlock.getMinYDirAdj() - maxY > Math.min(word.getHeight(), prev.getHeight()) * Y_GAP_SPLIT_HEIGHT_MODIFIER;
|
// boolean yGap = previousTextBlock != null && currentTextBlock.getMinYDirAdj() - maxY > Math.min(word.getHeight(), prev.getHeight()) * Y_GAP_SPLIT_HEIGHT_MODIFIER;
|
||||||
|
|
||||||
boolean alignsXRight = Math.abs(currentTextBlock.getPdfMaxX() - previousTextBlock.getPdfMaxX()) < X_ALIGNMENT_THRESHOLD;
|
boolean alignsXRight = Math.abs(currentTextBlock.getPdfMaxX() - previousTextBlock.getPdfMaxX()) < X_ALIGNMENT_THRESHOLD;
|
||||||
@ -119,8 +124,9 @@ public class TaasBlockificationService {
|
|||||||
}
|
}
|
||||||
alreadyMerged.add(textPageBlock);
|
alreadyMerged.add(textPageBlock);
|
||||||
textBlocksToMerge.add(Stream.concat(Stream.of(textPageBlock),
|
textBlocksToMerge.add(Stream.concat(Stream.of(textPageBlock),
|
||||||
textPageBlocks.stream().filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2)).peek(alreadyMerged::add))
|
textPageBlocks.stream()
|
||||||
.toList());
|
.filter(textPageBlock2 -> textPageBlock.almostIntersects(textPageBlock2, INTERSECTS_Y_THRESHOLD, 0) && !alreadyMerged.contains(textPageBlock2))
|
||||||
|
.peek(alreadyMerged::add)).toList());
|
||||||
}
|
}
|
||||||
return textBlocksToMerge.stream().map(TextPageBlock::merge).toList();
|
return textBlocksToMerge.stream().map(TextPageBlock::merge).toList();
|
||||||
}
|
}
|
||||||
@ -163,8 +169,7 @@ public class TaasBlockificationService {
|
|||||||
while (itty.hasNext()) {
|
while (itty.hasNext()) {
|
||||||
TextPageBlock block = (TextPageBlock) itty.next();
|
TextPageBlock block = (TextPageBlock) itty.next();
|
||||||
|
|
||||||
if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold(
|
if (previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation().equals(Orientation.LEFT) && equalsWithThreshold(block.getMaxY(),
|
||||||
block.getMaxY(),
|
|
||||||
previous.getMaxY()) || previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation()
|
previous.getMaxY()) || previous != null && previous.getOrientation().equals(Orientation.LEFT) && block.getOrientation()
|
||||||
.equals(Orientation.RIGHT) && equalsWithThreshold(block.getMaxY(), previous.getMaxY())) {
|
.equals(Orientation.RIGHT) && equalsWithThreshold(block.getMaxY(), previous.getMaxY())) {
|
||||||
previous.add(block);
|
previous.add(block);
|
||||||
@ -189,7 +194,6 @@ public class TaasBlockificationService {
|
|||||||
TextPositionSequence prev = null;
|
TextPositionSequence prev = null;
|
||||||
// TODO: make static final constant
|
// TODO: make static final constant
|
||||||
|
|
||||||
|
|
||||||
boolean wasSplitted = false;
|
boolean wasSplitted = false;
|
||||||
Float splitX1 = null;
|
Float splitX1 = null;
|
||||||
for (TextPositionSequence word : textPositions) {
|
for (TextPositionSequence word : textPositions) {
|
||||||
|
|||||||
@ -5,7 +5,6 @@ import java.util.Locale;
|
|||||||
import java.util.regex.Matcher;
|
import java.util.regex.Matcher;
|
||||||
import java.util.regex.Pattern;
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
|
||||||
import org.springframework.stereotype.Service;
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
@ -13,6 +12,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
@ -63,16 +63,16 @@ public class DocuMineClassificationService {
|
|||||||
textBlock.setClassification(PageBlockType.OTHER);
|
textBlock.setClassification(PageBlockType.OTHER);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
|
textBlock,
|
||||||
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())
|
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
|
||||||
) {
|
.getMostPopular())) {
|
||||||
textBlock.setClassification(PageBlockType.HEADER);
|
textBlock.setClassification(PageBlockType.HEADER);
|
||||||
|
|
||||||
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
|
textBlock,
|
||||||
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())
|
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
|
||||||
) {
|
.getMostPopular())) {
|
||||||
textBlock.setClassification(PageBlockType.FOOTER);
|
textBlock.setClassification(PageBlockType.FOOTER);
|
||||||
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
||||||
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
||||||
|
|||||||
@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.regex.Pattern;
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
|
||||||
import org.springframework.stereotype.Service;
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
@ -11,6 +10,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
@ -21,7 +21,6 @@ import lombok.extern.slf4j.Slf4j;
|
|||||||
@RequiredArgsConstructor
|
@RequiredArgsConstructor
|
||||||
public class RedactManagerClassificationService {
|
public class RedactManagerClassificationService {
|
||||||
|
|
||||||
|
|
||||||
public void classifyDocument(ClassificationDocument document) {
|
public void classifyDocument(ClassificationDocument document) {
|
||||||
|
|
||||||
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
|
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
|
||||||
@ -52,14 +51,16 @@ public class RedactManagerClassificationService {
|
|||||||
textBlock.setClassification(PageBlockType.OTHER);
|
textBlock.setClassification(PageBlockType.OTHER);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
|
textBlock,
|
||||||
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) {
|
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
|
||||||
|
.getMostPopular())) {
|
||||||
textBlock.setClassification(PageBlockType.HEADER);
|
textBlock.setClassification(PageBlockType.HEADER);
|
||||||
|
|
||||||
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation()) && (document.getFontSizeCounter()
|
textBlock,
|
||||||
.getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter().getMostPopular())) {
|
page.getRotation()) && (document.getFontSizeCounter().getMostPopular() == null || textBlock.getHighestFontSize() <= document.getFontSizeCounter()
|
||||||
|
.getMostPopular())) {
|
||||||
textBlock.setClassification(PageBlockType.FOOTER);
|
textBlock.setClassification(PageBlockType.FOOTER);
|
||||||
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
||||||
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
||||||
|
|||||||
@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.services.classifica
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.regex.Pattern;
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
|
||||||
import org.springframework.stereotype.Service;
|
import org.springframework.stereotype.Service;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
@ -12,6 +11,7 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageBlockType;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.BodyTextFrameService;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.utils.MarkedContentUtils;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.PositionUtils;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
@ -27,7 +27,6 @@ public class TaasClassificationService {
|
|||||||
|
|
||||||
public void classifyDocument(ClassificationDocument document) {
|
public void classifyDocument(ClassificationDocument document) {
|
||||||
|
|
||||||
|
|
||||||
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
|
List<Float> headlineFontSizes = document.getFontSizeCounter().getHighterThanMostPopular();
|
||||||
|
|
||||||
log.debug("Document FontSize counters are: {}", document.getFontSizeCounter().getCountPerValue());
|
log.debug("Document FontSize counters are: {}", document.getFontSizeCounter().getCountPerValue());
|
||||||
@ -57,11 +56,13 @@ public class TaasClassificationService {
|
|||||||
textBlock.setClassification(PageBlockType.OTHER);
|
textBlock.setClassification(PageBlockType.OTHER);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER)
|
if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.HEADER) || PositionUtils.isOverBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isOverBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) {
|
textBlock,
|
||||||
|
page.getRotation())) {
|
||||||
textBlock.setClassification(PageBlockType.HEADER);
|
textBlock.setClassification(PageBlockType.HEADER);
|
||||||
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER)
|
} else if (MarkedContentUtils.intersects(textBlock, page.getMarkedContentBboxPerType(), MarkedContentUtils.FOOTER) || PositionUtils.isUnderBodyTextFrame(bodyTextFrame,
|
||||||
|| PositionUtils.isUnderBodyTextFrame(bodyTextFrame, textBlock, page.getRotation())) {
|
textBlock,
|
||||||
|
page.getRotation())) {
|
||||||
textBlock.setClassification(PageBlockType.FOOTER);
|
textBlock.setClassification(PageBlockType.FOOTER);
|
||||||
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
} else if (page.getPageNumber() == 1 && (PositionUtils.getHeightDifferenceBetweenChunkWordAndDocumentWord(textBlock,
|
||||||
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
document.getTextHeightCounter().getMostPopular()) > 2.5 && textBlock.getHighestFontSize() > document.getFontSizeCounter().getMostPopular() || page.getTextBlocks()
|
||||||
|
|||||||
@ -18,8 +18,6 @@ import com.knecon.fforesight.service.layoutparser.processor.model.Classification
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationFooter;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationFooter;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationHeader;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationHeader;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationPage;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
|
||||||
@ -31,6 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Pa
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Paragraph;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Paragraph;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.IdBuilder;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.IdBuilder;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
|
||||||
|
|
||||||
|
|||||||
@ -8,10 +8,10 @@ import java.util.List;
|
|||||||
import java.util.Locale;
|
import java.util.Locale;
|
||||||
import java.util.Objects;
|
import java.util.Objects;
|
||||||
|
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.RedTextPosition;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.RedTextPosition;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.Boundary;
|
|
||||||
|
|
||||||
import lombok.experimental.UtilityClass;
|
import lombok.experimental.UtilityClass;
|
||||||
|
|
||||||
@ -110,6 +110,7 @@ public class SearchTextWithTextPositionFactory {
|
|||||||
return context.stringIdx - context.lastHyphenIdx < MAX_HYPHEN_LINEBREAK_DISTANCE;
|
return context.stringIdx - context.lastHyphenIdx < MAX_HYPHEN_LINEBREAK_DISTANCE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private static List<Boundary> mergeToBoundaries(List<Integer> integers) {
|
private static List<Boundary> mergeToBoundaries(List<Integer> integers) {
|
||||||
|
|
||||||
if (integers.isEmpty()) {
|
if (integers.isEmpty()) {
|
||||||
@ -125,8 +126,9 @@ public class SearchTextWithTextPositionFactory {
|
|||||||
}
|
}
|
||||||
end = current + 1;
|
end = current + 1;
|
||||||
}
|
}
|
||||||
if (boundaries.isEmpty())
|
if (boundaries.isEmpty()) {
|
||||||
boundaries.add(new Boundary(start, end));
|
boundaries.add(new Boundary(start, end));
|
||||||
|
}
|
||||||
return boundaries;
|
return boundaries;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -138,6 +140,7 @@ public class SearchTextWithTextPositionFactory {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
private boolean isLineBreak(RedTextPosition currentTextPosition, RedTextPosition previousTextPosition) {
|
private boolean isLineBreak(RedTextPosition currentTextPosition, RedTextPosition previousTextPosition) {
|
||||||
|
|
||||||
return Objects.equals(currentTextPosition.getUnicode(), "\n") || isDeltaYLargerThanTextHeight(currentTextPosition, previousTextPosition);
|
return Objects.equals(currentTextPosition.getUnicode(), "\n") || isDeltaYLargerThanTextHeight(currentTextPosition, previousTextPosition);
|
||||||
|
|||||||
@ -11,12 +11,12 @@ import java.util.Map;
|
|||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Section;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.image.ClassifiedImage;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.TableMergingUtility;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.TableMergingUtility;
|
||||||
|
|
||||||
import lombok.experimental.UtilityClass;
|
import lombok.experimental.UtilityClass;
|
||||||
|
|||||||
@ -8,15 +8,15 @@ import java.util.Set;
|
|||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.AbstractPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.GenericSemanticNode;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.TextBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.TextBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.TextPositionOperations;
|
||||||
|
|
||||||
import lombok.experimental.UtilityClass;
|
import lombok.experimental.UtilityClass;
|
||||||
|
|||||||
@ -2,10 +2,10 @@ package com.knecon.fforesight.service.layoutparser.processor.services.factory;
|
|||||||
|
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.SemanticNode;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.textblock.AtomicTextBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
|
|
||||||
import lombok.AccessLevel;
|
import lombok.AccessLevel;
|
||||||
import lombok.experimental.FieldDefaults;
|
import lombok.experimental.FieldDefaults;
|
||||||
|
|||||||
@ -7,11 +7,11 @@ import java.util.List;
|
|||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import java.util.NoSuchElementException;
|
import java.util.NoSuchElementException;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPage;
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPage;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentPositionData;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentTextData;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.DocumentTree;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Footer;
|
||||||
|
|||||||
@ -1,7 +1,6 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.processor.services.mapper;
|
package com.knecon.fforesight.service.layoutparser.processor.services.mapper;
|
||||||
|
|
||||||
import java.awt.geom.Rectangle2D;
|
import java.awt.geom.Rectangle2D;
|
||||||
import java.util.Collections;
|
|
||||||
import java.util.HashMap;
|
import java.util.HashMap;
|
||||||
import java.util.Locale;
|
import java.util.Locale;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
@ -9,7 +8,6 @@ import java.util.Map;
|
|||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
|
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Image;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Image;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.ImageType;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Page;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.TableCell;
|
||||||
|
|
||||||
|
|||||||
@ -329,6 +329,7 @@ public class PDFLinesTextStripper extends PDFTextStripper {
|
|||||||
.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj()) < maximumGapSize;
|
.getXDirAdj() - (previous.getXDirAdj() + previous.getWidthDirAdj()) < maximumGapSize;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public String getText(PDDocument doc) throws IOException {
|
public String getText(PDDocument doc) throws IOException {
|
||||||
|
|
||||||
|
|||||||
@ -25,10 +25,23 @@ import java.io.StringWriter;
|
|||||||
import java.io.Writer;
|
import java.io.Writer;
|
||||||
import java.text.Bidi;
|
import java.text.Bidi;
|
||||||
import java.text.Normalizer;
|
import java.text.Normalizer;
|
||||||
import java.util.*;
|
import java.util.ArrayDeque;
|
||||||
|
import java.util.ArrayList;
|
||||||
|
import java.util.Collections;
|
||||||
|
import java.util.Comparator;
|
||||||
|
import java.util.Deque;
|
||||||
|
import java.util.HashMap;
|
||||||
|
import java.util.Iterator;
|
||||||
|
import java.util.LinkedList;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
|
import java.util.SortedMap;
|
||||||
|
import java.util.SortedSet;
|
||||||
|
import java.util.StringTokenizer;
|
||||||
|
import java.util.TreeMap;
|
||||||
|
import java.util.TreeSet;
|
||||||
import java.util.regex.Pattern;
|
import java.util.regex.Pattern;
|
||||||
|
|
||||||
import lombok.Getter;
|
|
||||||
import org.apache.commons.logging.Log;
|
import org.apache.commons.logging.Log;
|
||||||
import org.apache.commons.logging.LogFactory;
|
import org.apache.commons.logging.LogFactory;
|
||||||
import org.apache.pdfbox.cos.COSDictionary;
|
import org.apache.pdfbox.cos.COSDictionary;
|
||||||
@ -46,6 +59,8 @@ import org.apache.pdfbox.text.TextPositionComparator;
|
|||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.QuickSort;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.QuickSort;
|
||||||
|
|
||||||
|
import lombok.Getter;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* This is just a copy except i only adjusted lines 594-607 cause this is a bug in Pdfbox.
|
* This is just a copy except i only adjusted lines 594-607 cause this is a bug in Pdfbox.
|
||||||
* see S416.pdf
|
* see S416.pdf
|
||||||
@ -194,40 +209,33 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
|
||||||
|
|
||||||
public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
|
|
||||||
{
|
|
||||||
PDMarkedContent markedContent = PDMarkedContent.create(tag, properties);
|
PDMarkedContent markedContent = PDMarkedContent.create(tag, properties);
|
||||||
if (this.currentMarkedContents.isEmpty())
|
if (this.currentMarkedContents.isEmpty()) {
|
||||||
{
|
|
||||||
this.markedContents.add(markedContent);
|
this.markedContents.add(markedContent);
|
||||||
}
|
} else {
|
||||||
else
|
PDMarkedContent currentMarkedContent = this.currentMarkedContents.peek();
|
||||||
{
|
if (currentMarkedContent != null) {
|
||||||
PDMarkedContent currentMarkedContent =
|
|
||||||
this.currentMarkedContents.peek();
|
|
||||||
if (currentMarkedContent != null)
|
|
||||||
{
|
|
||||||
currentMarkedContent.addMarkedContent(markedContent);
|
currentMarkedContent.addMarkedContent(markedContent);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
this.currentMarkedContents.push(markedContent);
|
this.currentMarkedContents.push(markedContent);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@Override
|
@Override
|
||||||
public void endMarkedContentSequence()
|
public void endMarkedContentSequence() {
|
||||||
{
|
|
||||||
if (!this.currentMarkedContents.isEmpty())
|
if (!this.currentMarkedContents.isEmpty()) {
|
||||||
{
|
|
||||||
this.currentMarkedContents.pop();
|
this.currentMarkedContents.pop();
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public void xobject(PDXObject xobject)
|
public void xobject(PDXObject xobject) {
|
||||||
{
|
|
||||||
if (!this.currentMarkedContents.isEmpty())
|
if (!this.currentMarkedContents.isEmpty()) {
|
||||||
{
|
|
||||||
this.currentMarkedContents.peek().addXObject(xobject);
|
this.currentMarkedContents.peek().addXObject(xobject);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -635,7 +643,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
|
|||||||
var normalized = normalize(line);
|
var normalized = normalize(line);
|
||||||
// normalized.stream().filter(l -> System.out.println(l.getText().contains("Plenarprotokoll 20/24")).findFirst().isPresent()
|
// normalized.stream().filter(l -> System.out.println(l.getText().contains("Plenarprotokoll 20/24")).findFirst().isPresent()
|
||||||
|
|
||||||
|
|
||||||
lastLineStartPosition = handleLineSeparation(current, lastPosition, lastLineStartPosition, maxHeightForLine);
|
lastLineStartPosition = handleLineSeparation(current, lastPosition, lastLineStartPosition, maxHeightForLine);
|
||||||
writeLine(normalized, current.isParagraphStart);
|
writeLine(normalized, current.isParagraphStart);
|
||||||
line.clear();
|
line.clear();
|
||||||
@ -914,8 +921,7 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
|
|||||||
textList.add(text);
|
textList.add(text);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (!this.currentMarkedContents.isEmpty())
|
if (!this.currentMarkedContents.isEmpty()) {
|
||||||
{
|
|
||||||
this.currentMarkedContents.peek().addText(text);
|
this.currentMarkedContents.peek().addText(text);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@ -2102,7 +2108,9 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
|
|||||||
return endParagraphWritten;
|
return endParagraphWritten;
|
||||||
}
|
}
|
||||||
|
|
||||||
public void setEndParagraphWritten(){
|
|
||||||
|
public void setEndParagraphWritten() {
|
||||||
|
|
||||||
endParagraphWritten = true;
|
endParagraphWritten = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -2145,7 +2153,6 @@ public class PDFTextStripper extends LegacyPDFStreamEngine {
|
|||||||
this.isHangingIndent = true;
|
this.isHangingIndent = true;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -1,10 +1,13 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.processor.services.visualization;
|
package com.knecon.fforesight.service.layoutparser.processor.services.visualization;
|
||||||
|
|
||||||
|
import java.awt.Color;
|
||||||
import java.awt.geom.AffineTransform;
|
import java.awt.geom.AffineTransform;
|
||||||
import java.awt.geom.Rectangle2D;
|
import java.awt.geom.Rectangle2D;
|
||||||
import java.io.IOException;
|
import java.io.IOException;
|
||||||
import java.io.OutputStream;
|
import java.io.OutputStream;
|
||||||
import java.util.HashSet;
|
import java.util.HashSet;
|
||||||
|
import java.util.List;
|
||||||
|
import java.util.Map;
|
||||||
import java.util.Set;
|
import java.util.Set;
|
||||||
|
|
||||||
import org.apache.pdfbox.cos.COSDictionary;
|
import org.apache.pdfbox.cos.COSDictionary;
|
||||||
@ -30,6 +33,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.visualization.
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.LayoutGrid;
|
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.LayoutGrid;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.PlacedText;
|
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.PlacedText;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.VisualizationsOnPage;
|
import com.knecon.fforesight.service.layoutparser.processor.model.visualization.VisualizationsOnPage;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableCells;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.VisualLayoutParsingResult;
|
||||||
|
|
||||||
import lombok.RequiredArgsConstructor;
|
import lombok.RequiredArgsConstructor;
|
||||||
import lombok.SneakyThrows;
|
import lombok.SneakyThrows;
|
||||||
@ -40,7 +45,6 @@ import lombok.extern.slf4j.Slf4j;
|
|||||||
@RequiredArgsConstructor
|
@RequiredArgsConstructor
|
||||||
public class ViewerDocumentService {
|
public class ViewerDocumentService {
|
||||||
|
|
||||||
|
|
||||||
private static final String LAYER_NAME = "Layout grid";
|
private static final String LAYER_NAME = "Layout grid";
|
||||||
private static final int FONT_SIZE = 10;
|
private static final int FONT_SIZE = 10;
|
||||||
public static final float LINE_WIDTH = 1f;
|
public static final float LINE_WIDTH = 1f;
|
||||||
@ -49,13 +53,18 @@ public class ViewerDocumentService {
|
|||||||
|
|
||||||
|
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
public void createViewerDocument(PDDocument pdDocument, Document document, OutputStream outputStream, boolean layerVisibilityDefaultValue) {
|
public void createViewerDocument(PDDocument pdDocument,
|
||||||
|
Document document,
|
||||||
|
OutputStream outputStream,
|
||||||
|
Map<Integer, List<VisualLayoutParsingResult>> extractedTableCells,
|
||||||
|
boolean layerVisibilityDefaultValue) {
|
||||||
|
|
||||||
LayoutGrid layoutGrid = layoutGridService.createLayoutGrid(document);
|
LayoutGrid layoutGrid = layoutGridService.createLayoutGrid(document);
|
||||||
// PDDocument.save() is very slow, since it actually traverses the entire pdf and writes a new one.
|
// PDDocument.save() is very slow, since it actually traverses the entire pdf and writes a new one.
|
||||||
// If we collect all COSDictionaries we changed and tell it explicitly to only add the changed ones by using saveIncremental it's very fast.
|
// If we collect all COSDictionaries we changed and tell it explicitly to only add the changed ones by using saveIncremental it's very fast.
|
||||||
Set<COSDictionary> dictionariesToUpdate = new HashSet<>();
|
Set<COSDictionary> dictionariesToUpdate = new HashSet<>();
|
||||||
PDOptionalContentGroup layer = addLayerToDocument(pdDocument, dictionariesToUpdate, layerVisibilityDefaultValue);
|
PDOptionalContentGroup layer = addLayerToDocument(pdDocument, dictionariesToUpdate, layerVisibilityDefaultValue);
|
||||||
|
PDOptionalContentGroup visualLayoutParsingLayer = addLayerToDocument(pdDocument, dictionariesToUpdate, true);
|
||||||
PDFont font = new PDType1Font(Standard14Fonts.FontName.HELVETICA);
|
PDFont font = new PDType1Font(Standard14Fonts.FontName.HELVETICA);
|
||||||
|
|
||||||
for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); pageNumber++) {
|
for (int pageNumber = 0; pageNumber < pdDocument.getNumberOfPages(); pageNumber++) {
|
||||||
@ -114,6 +123,30 @@ public class ViewerDocumentService {
|
|||||||
}
|
}
|
||||||
contentStream.restoreGraphicsState();
|
contentStream.restoreGraphicsState();
|
||||||
contentStream.endMarkedContent();
|
contentStream.endMarkedContent();
|
||||||
|
|
||||||
|
contentStream.beginMarkedContent(COSName.OC, visualLayoutParsingLayer);
|
||||||
|
contentStream.saveGraphicsState();
|
||||||
|
|
||||||
|
contentStream.setLineWidth(LINE_WIDTH);
|
||||||
|
for (VisualLayoutParsingResult tableCells : extractedTableCells.get(pageNumber)) {
|
||||||
|
contentStream.setStrokingColor(new Color(0xFF0000));
|
||||||
|
contentStream.addRect((float) tableCells.getX0(), (float) tableCells.getY0(), (float) tableCells.getWidth(), (float) tableCells.getHeight());
|
||||||
|
contentStream.stroke();
|
||||||
|
contentStream.setFont(font, FONT_SIZE);
|
||||||
|
contentStream.beginText();
|
||||||
|
Matrix textMatrix = new Matrix((float) textDeRotationMatrix.getScaleX(),
|
||||||
|
(float) textDeRotationMatrix.getShearX(),
|
||||||
|
(float) textDeRotationMatrix.getShearY(),
|
||||||
|
(float) textDeRotationMatrix.getScaleY(),
|
||||||
|
tableCells.getX0() ,
|
||||||
|
tableCells.getY0());
|
||||||
|
textMatrix.translate(-((font.getStringWidth(tableCells.getLabel()) / 1000) * FONT_SIZE + (2 * LINE_WIDTH) + 4), -FONT_SIZE);
|
||||||
|
contentStream.setTextMatrix(textMatrix);
|
||||||
|
contentStream.showText(tableCells.getLabel());
|
||||||
|
contentStream.endText();
|
||||||
|
}
|
||||||
|
contentStream.restoreGraphicsState();
|
||||||
|
contentStream.endMarkedContent();
|
||||||
}
|
}
|
||||||
dictionariesToUpdate.add(pdPage.getCOSObject());
|
dictionariesToUpdate.add(pdPage.getCOSObject());
|
||||||
dictionariesToUpdate.add(pdPage.getResources().getCOSObject());
|
dictionariesToUpdate.add(pdPage.getResources().getCOSObject());
|
||||||
|
|||||||
@ -1,12 +1,5 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.processor.utils;
|
package com.knecon.fforesight.service.layoutparser.processor.utils;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
|
||||||
import lombok.experimental.UtilityClass;
|
|
||||||
import org.apache.pdfbox.cos.COSName;
|
|
||||||
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
|
|
||||||
import org.apache.pdfbox.text.TextPosition;
|
|
||||||
|
|
||||||
import java.awt.geom.Rectangle2D;
|
import java.awt.geom.Rectangle2D;
|
||||||
import java.util.Collection;
|
import java.util.Collection;
|
||||||
import java.util.Collections;
|
import java.util.Collections;
|
||||||
@ -14,12 +7,22 @@ import java.util.List;
|
|||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
|
|
||||||
|
import org.apache.pdfbox.cos.COSName;
|
||||||
|
import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
|
||||||
|
import org.apache.pdfbox.text.TextPosition;
|
||||||
|
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
|
|
||||||
|
import lombok.experimental.UtilityClass;
|
||||||
|
|
||||||
@UtilityClass
|
@UtilityClass
|
||||||
public class MarkedContentUtils {
|
public class MarkedContentUtils {
|
||||||
|
|
||||||
public static final String HEADER = "Header";
|
public static final String HEADER = "Header";
|
||||||
public static final String FOOTER = "Footer";
|
public static final String FOOTER = "Footer";
|
||||||
|
|
||||||
|
|
||||||
public List<Rectangle2D> getMarkedContentBboxPerLine(List<PDMarkedContent> markedContents, String subtype) {
|
public List<Rectangle2D> getMarkedContentBboxPerLine(List<PDMarkedContent> markedContents, String subtype) {
|
||||||
|
|
||||||
if (markedContents == null) {
|
if (markedContents == null) {
|
||||||
@ -31,7 +34,8 @@ public class MarkedContentUtils {
|
|||||||
.filter(m -> m.getProperties() != null)
|
.filter(m -> m.getProperties() != null)
|
||||||
.filter(m -> m.getProperties().getItem("Subtype") != null)
|
.filter(m -> m.getProperties().getItem("Subtype") != null)
|
||||||
.filter(m -> ((COSName) m.getProperties().getItem("Subtype")).getName().equals(subtype))
|
.filter(m -> ((COSName) m.getProperties().getItem("Subtype")).getName().equals(subtype))
|
||||||
.map(PDMarkedContent::getContents).flatMap(Collection::stream)
|
.map(PDMarkedContent::getContents)
|
||||||
|
.flatMap(Collection::stream)
|
||||||
.filter(t -> t instanceof TextPosition)
|
.filter(t -> t instanceof TextPosition)
|
||||||
.map(t -> (TextPosition) t)
|
.map(t -> (TextPosition) t)
|
||||||
.filter(t -> !t.getUnicode().equals(" "))
|
.filter(t -> !t.getUnicode().equals(" "))
|
||||||
@ -41,16 +45,19 @@ public class MarkedContentUtils {
|
|||||||
return Collections.emptyList();
|
return Collections.emptyList();
|
||||||
}
|
}
|
||||||
|
|
||||||
return markedContentByYPosition.values().stream()
|
return markedContentByYPosition.values()
|
||||||
.map(textPositions -> new TextPositionSequence(textPositions.stream()
|
.stream()
|
||||||
.toList(), 0, true)
|
.map(textPositions -> new TextPositionSequence(textPositions.stream().toList(), 0, true).getRectangle())
|
||||||
.getRectangle())
|
.map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight())))
|
||||||
.map(t -> new Rectangle2D.Float(t.getTopLeft().getX(), t.getTopLeft().getY() - Math.abs(t.getHeight()), t.getWidth(), Math.abs(t.getHeight()))).collect(Collectors.toList());
|
.collect(Collectors.toList());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public boolean intersects(TextPageBlock textBlock, Map<String, List<Rectangle2D>> markedContentBboxPerType, String type) {
|
public boolean intersects(TextPageBlock textBlock, Map<String, List<Rectangle2D>> markedContentBboxPerType, String type) {
|
||||||
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type).stream().anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
|
|
||||||
|
return markedContentBboxPerType.get(type) != null && markedContentBboxPerType.get(type)
|
||||||
|
.stream()
|
||||||
|
.anyMatch(rectangle -> rectangle.intersects(textBlock.getPdfMinX(), textBlock.getPdfMinY(), textBlock.getWidth(), textBlock.getHeight()));
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -19,10 +19,9 @@ public final class PositionUtils {
|
|||||||
|
|
||||||
double threshold = textBlock.getMostPopularWordHeight() * 3;
|
double threshold = textBlock.getMostPopularWordHeight() * 3;
|
||||||
|
|
||||||
if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX()
|
if (textBlock.getPdfMinX() + threshold > btf.getTopLeft().getX() && textBlock.getPdfMaxX() - threshold < btf.getTopLeft()
|
||||||
&& textBlock.getPdfMaxX() - threshold < btf.getTopLeft().getX() + btf.getWidth()
|
.getX() + btf.getWidth() && textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY() && textBlock.getPdfMaxY() - threshold < btf.getTopLeft()
|
||||||
&& textBlock.getPdfMinY() + threshold > btf.getTopLeft().getY()
|
.getY() + btf.getHeight()) {
|
||||||
&& textBlock.getPdfMaxY() - threshold < btf.getTopLeft().getY() + btf.getHeight()) {
|
|
||||||
return true;
|
return true;
|
||||||
} else {
|
} else {
|
||||||
return false;
|
return false;
|
||||||
|
|||||||
@ -41,11 +41,14 @@ public class RectangleTransformations {
|
|||||||
|
|
||||||
return atomicTextBlocks.stream().flatMap(atomicTextBlock -> atomicTextBlock.getPositions().stream()).collect(new Rectangle2DBBoxCollector());
|
return atomicTextBlocks.stream().flatMap(atomicTextBlock -> atomicTextBlock.getPositions().stream()).collect(new Rectangle2DBBoxCollector());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> collectBBox() {
|
public static Collector<Rectangle2D, Rectangle2DBBoxCollector.BBox, Rectangle2D> collectBBox() {
|
||||||
|
|
||||||
return new Rectangle2DBBoxCollector();
|
return new Rectangle2DBBoxCollector();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static PDRectangle toPDRectangleBBox(List<Rectangle> rectangles) {
|
public static PDRectangle toPDRectangleBBox(List<Rectangle> rectangles) {
|
||||||
|
|
||||||
Rectangle2D rectangle2D = RectangleTransformations.rectangleBBox(rectangles);
|
Rectangle2D rectangle2D = RectangleTransformations.rectangleBBox(rectangles);
|
||||||
@ -70,6 +73,7 @@ public class RectangleTransformations {
|
|||||||
return format("%f,%f,%f,%f", rectangle2D.getX(), rectangle2D.getY(), rectangle2D.getWidth(), rectangle2D.getHeight());
|
return format("%f,%f,%f,%f", rectangle2D.getX(), rectangle2D.getY(), rectangle2D.getWidth(), rectangle2D.getHeight());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static Rectangle2D rectangleBBox(List<Rectangle> rectangles) {
|
public static Rectangle2D rectangleBBox(List<Rectangle> rectangles) {
|
||||||
|
|
||||||
return rectangles.stream().map(RectangleTransformations::toRectangle2D).collect(new Rectangle2DBBoxCollector());
|
return rectangles.stream().map(RectangleTransformations::toRectangle2D).collect(new Rectangle2DBBoxCollector());
|
||||||
@ -84,6 +88,7 @@ public class RectangleTransformations {
|
|||||||
-redactionLogRectangle.getHeight());
|
-redactionLogRectangle.getHeight());
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public static Rectangle2D toRectangle2D(PDRectangle rectangle) {
|
public static Rectangle2D toRectangle2D(PDRectangle rectangle) {
|
||||||
|
|
||||||
return new Rectangle2D.Double(rectangle.getLowerLeftX(), rectangle.getLowerLeftY(), rectangle.getWidth(), rectangle.getHeight());
|
return new Rectangle2D.Double(rectangle.getLowerLeftX(), rectangle.getLowerLeftY(), rectangle.getWidth(), rectangle.getHeight());
|
||||||
|
|||||||
@ -3,7 +3,6 @@ package com.knecon.fforesight.service.layoutparser.processor.utils;
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.stream.Collectors;
|
import java.util.stream.Collectors;
|
||||||
|
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPageBlock;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
|
|
||||||
|
|||||||
@ -28,15 +28,13 @@ import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPosit
|
|||||||
*
|
*
|
||||||
* @author Ben Litchfield
|
* @author Ben Litchfield
|
||||||
*/
|
*/
|
||||||
public class TextPositionSequenceComparator implements Comparator<TextPositionSequence>
|
public class TextPositionSequenceComparator implements Comparator<TextPositionSequence> {
|
||||||
{
|
|
||||||
@Override
|
@Override
|
||||||
public int compare(TextPositionSequence pos1, TextPositionSequence pos2)
|
public int compare(TextPositionSequence pos1, TextPositionSequence pos2) {
|
||||||
{
|
|
||||||
// only compare text that is in the same direction
|
// only compare text that is in the same direction
|
||||||
int cmp1 = Float.compare(pos1.getDir().getDegrees(), pos2.getDir().getDegrees());
|
int cmp1 = Float.compare(pos1.getDir().getDegrees(), pos2.getDir().getDegrees());
|
||||||
if (cmp1 != 0)
|
if (cmp1 != 0) {
|
||||||
{
|
|
||||||
return cmp1;
|
return cmp1;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -54,19 +52,13 @@ public class TextPositionSequenceComparator implements Comparator<TextPositionSe
|
|||||||
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
|
float yDifference = Math.abs(pos1YBottom - pos2YBottom);
|
||||||
|
|
||||||
// we will do a simple tolerance comparison
|
// we will do a simple tolerance comparison
|
||||||
if (yDifference < .1 ||
|
if (yDifference < .1 || pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom || pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom) {
|
||||||
pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom ||
|
|
||||||
pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom)
|
|
||||||
{
|
|
||||||
return Float.compare(x1, x2);
|
return Float.compare(x1, x2);
|
||||||
}
|
} else if (pos1YBottom < pos2YBottom) {
|
||||||
else if (pos1YBottom < pos2YBottom)
|
|
||||||
{
|
|
||||||
return -1;
|
return -1;
|
||||||
}
|
} else {
|
||||||
else
|
|
||||||
{
|
|
||||||
return 1;
|
return 1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -14,7 +14,7 @@ import com.knecon.fforesight.service.layoutparser.server.queue.MessagingConfigur
|
|||||||
import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration;
|
import com.knecon.fforesight.tenantcommons.MultiTenancyAutoConfiguration;
|
||||||
|
|
||||||
@ImportAutoConfiguration({MultiTenancyAutoConfiguration.class})
|
@ImportAutoConfiguration({MultiTenancyAutoConfiguration.class})
|
||||||
@Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class})
|
@Import({MetricsConfiguration.class, StorageAutoConfiguration.class, LayoutParsingServiceProcessorConfiguration.class, MessagingConfiguration.class})
|
||||||
@SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class})
|
@SpringBootApplication(exclude = {SecurityAutoConfiguration.class, ManagementWebSecurityAutoConfiguration.class})
|
||||||
public class Application {
|
public class Application {
|
||||||
|
|
||||||
|
|||||||
@ -17,6 +17,7 @@ import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest
|
|||||||
import lombok.SneakyThrows;
|
import lombok.SneakyThrows;
|
||||||
|
|
||||||
public class DocumentDataTests extends BuildDocumentTest {
|
public class DocumentDataTests extends BuildDocumentTest {
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
public void createDocumentDataForAllFiles() {
|
public void createDocumentDataForAllFiles() {
|
||||||
@ -36,11 +37,12 @@ public class DocumentDataTests extends BuildDocumentTest {
|
|||||||
for (String pdfFileName : pdfFileNames) {
|
for (String pdfFileName : pdfFileNames) {
|
||||||
System.out.println(pdfFileName);
|
System.out.println(pdfFileName);
|
||||||
DocumentData documentData = DocumentDataMapper.toDocumentData(buildGraph(resource.getFile().toPath().getParent().relativize(Path.of(pdfFileName)).toString()));
|
DocumentData documentData = DocumentDataMapper.toDocumentData(buildGraph(resource.getFile().toPath().getParent().relativize(Path.of(pdfFileName)).toString()));
|
||||||
File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile();
|
File outputFile = Path.of(outPath).resolve(resource.getFile().toPath().relativize(Path.of(pdfFileName))).toFile();
|
||||||
outputFile.toPath().getParent().toFile().mkdirs();
|
outputFile.toPath().getParent().toFile().mkdirs();
|
||||||
try (var out = new FileOutputStream(outputFile.toString().replace(".pdf", ".json"))) {
|
try (var out = new FileOutputStream(outputFile.toString().replace(".pdf", ".json"))) {
|
||||||
ObjectMapperFactory.create().writeValue(out, documentData);
|
ObjectMapperFactory.create().writeValue(out, documentData);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -7,7 +7,6 @@ import java.util.HashMap;
|
|||||||
import java.util.List;
|
import java.util.List;
|
||||||
import java.util.Map;
|
import java.util.Map;
|
||||||
|
|
||||||
import lombok.SneakyThrows;
|
|
||||||
import org.apache.pdfbox.Loader;
|
import org.apache.pdfbox.Loader;
|
||||||
import org.apache.pdfbox.cos.COSArray;
|
import org.apache.pdfbox.cos.COSArray;
|
||||||
import org.apache.pdfbox.cos.COSBase;
|
import org.apache.pdfbox.cos.COSBase;
|
||||||
@ -25,18 +24,24 @@ import org.junit.jupiter.api.Disabled;
|
|||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
import org.springframework.core.io.ClassPathResource;
|
import org.springframework.core.io.ClassPathResource;
|
||||||
|
|
||||||
|
import lombok.SneakyThrows;
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @author mkl
|
* @author mkl
|
||||||
*/
|
*/
|
||||||
@Disabled
|
@Disabled
|
||||||
public class ExtractMarkedContentTest {
|
public class ExtractMarkedContentTest {
|
||||||
|
|
||||||
final static File RESULT_FOLDER = new File("target/test-outputs", "extract");
|
final static File RESULT_FOLDER = new File("target/test-outputs", "extract");
|
||||||
|
|
||||||
|
|
||||||
@BeforeEach
|
@BeforeEach
|
||||||
public void setUpBeforeClass() throws Exception {
|
public void setUpBeforeClass() throws Exception {
|
||||||
|
|
||||||
RESULT_FOLDER.mkdirs();
|
RESULT_FOLDER.mkdirs();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* <a href="https://stackoverflow.com/questions/54956720/how-to-replace-a-space-with-a-word-while-extract-the-data-from-pdf-using-pdfbox">
|
* <a href="https://stackoverflow.com/questions/54956720/how-to-replace-a-space-with-a-word-while-extract-the-data-from-pdf-using-pdfbox">
|
||||||
* How to replace a space with a word while extract the data from PDF using PDFBox
|
* How to replace a space with a word while extract the data from PDF using PDFBox
|
||||||
@ -52,6 +57,7 @@ public class ExtractMarkedContentTest {
|
|||||||
@Test
|
@Test
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
public void testExtractTestWPhromma() throws IOException {
|
public void testExtractTestWPhromma() throws IOException {
|
||||||
|
|
||||||
System.out.printf("\n\n===\n%s\n===\n", "testWPhromma.pdf");
|
System.out.printf("\n\n===\n%s\n===\n", "testWPhromma.pdf");
|
||||||
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
||||||
|
|
||||||
@ -74,6 +80,7 @@ public class ExtractMarkedContentTest {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* <a href="https://stackoverflow.com/questions/59192443/get-tags-related-bboxs-even-though-there-is-no-attributes-a-in-document-cata">
|
* <a href="https://stackoverflow.com/questions/59192443/get-tags-related-bboxs-even-though-there-is-no-attributes-a-in-document-cata">
|
||||||
* Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox?
|
* Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox?
|
||||||
@ -88,9 +95,10 @@ public class ExtractMarkedContentTest {
|
|||||||
*/
|
*/
|
||||||
@Test
|
@Test
|
||||||
public void testExtractResMultipage() throws IOException {
|
public void testExtractResMultipage() throws IOException {
|
||||||
|
|
||||||
System.out.printf("\n\n===\n%s\n===\n", "res_multipage.pdf");
|
System.out.printf("\n\n===\n%s\n===\n", "res_multipage.pdf");
|
||||||
|
|
||||||
try(PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
||||||
|
|
||||||
Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();
|
Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();
|
||||||
|
|
||||||
@ -111,6 +119,7 @@ public class ExtractMarkedContentTest {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* <a href="https://issues.apache.org/jira/browse/PDFBOX-5613">
|
* <a href="https://issues.apache.org/jira/browse/PDFBOX-5613">
|
||||||
* PDFBOX-5613 - uncorrent paragraph split
|
* PDFBOX-5613 - uncorrent paragraph split
|
||||||
@ -125,6 +134,7 @@ public class ExtractMarkedContentTest {
|
|||||||
*/
|
*/
|
||||||
@Test
|
@Test
|
||||||
public void testExtractDailyReport() throws IOException {
|
public void testExtractDailyReport() throws IOException {
|
||||||
|
|
||||||
System.out.printf("\n\n===\n%s\n===\n", "Daily Report.pdf");
|
System.out.printf("\n\n===\n%s\n===\n", "Daily Report.pdf");
|
||||||
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
try (PDDocument document = Loader.loadPDF(new ClassPathResource("files/bdr/Drucksache_19_9865.pdf").getFile())) {
|
||||||
|
|
||||||
@ -147,10 +157,12 @@ public class ExtractMarkedContentTest {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @see #testExtractTestWPhromma()
|
* @see #testExtractTestWPhromma()
|
||||||
*/
|
*/
|
||||||
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) {
|
void showStructure(PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents) {
|
||||||
|
|
||||||
String structType = null;
|
String structType = null;
|
||||||
PDPage page = null;
|
PDPage page = null;
|
||||||
if (node instanceof PDStructureElement) {
|
if (node instanceof PDStructureElement) {
|
||||||
@ -166,7 +178,7 @@ public class ExtractMarkedContentTest {
|
|||||||
if (base instanceof COSDictionary) {
|
if (base instanceof COSDictionary) {
|
||||||
showStructure(PDStructureNode.create((COSDictionary) base), markedContents);
|
showStructure(PDStructureNode.create((COSDictionary) base), markedContents);
|
||||||
} else if (base instanceof COSNumber) {
|
} else if (base instanceof COSNumber) {
|
||||||
showContent(((COSNumber)base).intValue(), theseMarkedContents);
|
showContent(((COSNumber) base).intValue(), theseMarkedContents);
|
||||||
} else {
|
} else {
|
||||||
System.out.printf("?%s\n", base);
|
System.out.printf("?%s\n", base);
|
||||||
}
|
}
|
||||||
@ -174,7 +186,7 @@ public class ExtractMarkedContentTest {
|
|||||||
} else if (object instanceof PDStructureNode) {
|
} else if (object instanceof PDStructureNode) {
|
||||||
showStructure((PDStructureNode) object, markedContents);
|
showStructure((PDStructureNode) object, markedContents);
|
||||||
} else if (object instanceof Integer) {
|
} else if (object instanceof Integer) {
|
||||||
showContent((Integer)object, theseMarkedContents);
|
showContent((Integer) object, theseMarkedContents);
|
||||||
} else {
|
} else {
|
||||||
System.out.printf("?%s\n", object);
|
System.out.printf("?%s\n", object);
|
||||||
}
|
}
|
||||||
@ -183,21 +195,24 @@ public class ExtractMarkedContentTest {
|
|||||||
System.out.printf("</%s>\n", structType);
|
System.out.printf("</%s>\n", structType);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* @see #showStructure(PDStructureNode, Map)
|
* @see #showStructure(PDStructureNode, Map)
|
||||||
* @see #testExtractTestWPhromma()
|
* @see #testExtractTestWPhromma()
|
||||||
*/
|
*/
|
||||||
void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) {
|
void showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) {
|
||||||
|
|
||||||
PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
|
PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
|
||||||
List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
|
List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
|
||||||
StringBuilder textContent = new StringBuilder();
|
StringBuilder textContent = new StringBuilder();
|
||||||
for (Object object : contents) {
|
for (Object object : contents) {
|
||||||
if (object instanceof TextPosition) {
|
if (object instanceof TextPosition) {
|
||||||
textContent.append(((TextPosition)object).getUnicode());
|
textContent.append(((TextPosition) object).getUnicode());
|
||||||
} else {
|
} else {
|
||||||
textContent.append("?" + object);
|
textContent.append("?" + object);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
System.out.printf("%s\n", textContent);
|
System.out.printf("%s\n", textContent);
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@ -2,31 +2,20 @@ package com.knecon.fforesight.service.layoutparser.server.graph;
|
|||||||
|
|
||||||
import java.io.FileOutputStream;
|
import java.io.FileOutputStream;
|
||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.List;
|
|
||||||
|
|
||||||
import org.apache.pdfbox.Loader;
|
import org.apache.pdfbox.Loader;
|
||||||
import org.apache.pdfbox.pdmodel.PDDocument;
|
import org.apache.pdfbox.pdmodel.PDDocument;
|
||||||
import org.junit.jupiter.api.Disabled;
|
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
import org.springframework.beans.factory.annotation.Autowired;
|
import org.springframework.beans.factory.annotation.Autowired;
|
||||||
import org.springframework.core.io.ClassPathResource;
|
import org.springframework.core.io.ClassPathResource;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentData;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.DocumentStructure;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.data.redaction.NodeType;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingType;
|
import com.knecon.fforesight.service.layoutparser.internal.api.queue.LayoutParsingType;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationDocument;
|
import com.knecon.fforesight.service.layoutparser.processor.model.ClassificationDocument;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Document;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.graph.nodes.Table;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.table.Cell;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.table.TablePageBlock;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.image.ImageServiceResponse;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
import com.knecon.fforesight.service.layoutparser.processor.python_api.model.table.TableServiceResponse;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.SectionsBuilderService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.classification.RedactManagerClassificationService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.classification.RedactManagerClassificationService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.factory.DocumentGraphFactory;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.DocumentDataMapper;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.mapper.PropertiesMapper;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.LayoutGridService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.LayoutGridService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.ViewerDocumentService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.visualization.ViewerDocumentService;
|
||||||
import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest;
|
import com.knecon.fforesight.service.layoutparser.server.utils.BuildDocumentTest;
|
||||||
@ -41,6 +30,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
|
|||||||
@Autowired
|
@Autowired
|
||||||
private RedactManagerClassificationService redactManagerClassificationService;
|
private RedactManagerClassificationService redactManagerClassificationService;
|
||||||
|
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
public void testViewerDocument() {
|
public void testViewerDocument() {
|
||||||
@ -55,6 +45,7 @@ public class ViewerDocumentTest extends BuildDocumentTest {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
public ClassificationDocument buildClassificationDocument(PDDocument originDocument) {
|
public ClassificationDocument buildClassificationDocument(PDDocument originDocument) {
|
||||||
|
|
||||||
ClassificationDocument classificationDocument = layoutParsingPipeline.parseLayout(LayoutParsingType.REDACT_MANAGER,
|
ClassificationDocument classificationDocument = layoutParsingPipeline.parseLayout(LayoutParsingType.REDACT_MANAGER,
|
||||||
|
|||||||
@ -9,7 +9,6 @@ import org.apache.pdfbox.util.Matrix;
|
|||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
import com.fasterxml.jackson.databind.ObjectMapper;
|
import com.fasterxml.jackson.databind.ObjectMapper;
|
||||||
import com.iqser.red.storage.commons.properties.StorageProperties;
|
|
||||||
import com.iqser.red.storage.commons.service.ObjectSerializer;
|
import com.iqser.red.storage.commons.service.ObjectSerializer;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextDirection;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
|
|||||||
@ -13,8 +13,8 @@ import com.knecon.fforesight.service.layoutparser.processor.model.PageInformatio
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.services.DividingColumnDetectionService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.DividingColumnDetectionService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.GapDetectionService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.GapDetectionService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.GapsAcrossLinesService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.GapsAcrossLinesService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
||||||
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
||||||
|
|
||||||
import lombok.SneakyThrows;
|
import lombok.SneakyThrows;
|
||||||
@ -36,7 +36,8 @@ class GapAcrossLinesDetectionServiceTest {
|
|||||||
System.out.println("start column detection");
|
System.out.println("start column detection");
|
||||||
start = System.currentTimeMillis();
|
start = System.currentTimeMillis();
|
||||||
for (PageInformation pageInformation : pageInformations) {
|
for (PageInformation pageInformation : pageInformations) {
|
||||||
GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(), pageInformation.getMainBodyTextFrame());
|
GapInformation gapInformation = GapDetectionService.findGapsInLines(pageInformation.getPageContents().getSortedTextPositionSequences(),
|
||||||
|
pageInformation.getMainBodyTextFrame());
|
||||||
columnsPerPage.add(GapsAcrossLinesService.detectXGapsAcrossLines(gapInformation, pageInformation.getMainBodyTextFrame()));
|
columnsPerPage.add(GapsAcrossLinesService.detectXGapsAcrossLines(gapInformation, pageInformation.getMainBodyTextFrame()));
|
||||||
}
|
}
|
||||||
System.out.printf("Finished column detection in %d ms%n", System.currentTimeMillis() - start);
|
System.out.printf("Finished column detection in %d ms%n", System.currentTimeMillis() - start);
|
||||||
|
|||||||
@ -12,8 +12,8 @@ import org.junit.jupiter.api.Test;
|
|||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.InvisibleTableDetectionService;
|
import com.knecon.fforesight.service.layoutparser.processor.services.InvisibleTableDetectionService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
||||||
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
||||||
|
|
||||||
|
|||||||
@ -22,7 +22,6 @@ class MainBodyTextFrameExtractionServiceTest {
|
|||||||
String tmpFileName = Path.of("/tmp/").resolve(Path.of(fileName).getFileName() + "_MAIN_BODY.pdf").toString();
|
String tmpFileName = Path.of("/tmp/").resolve(Path.of(fileName).getFileName() + "_MAIN_BODY.pdf").toString();
|
||||||
List<PageContents> sortedTextPositionSequence = PageContentExtractor.getSortedPageContents(fileName);
|
List<PageContents> sortedTextPositionSequence = PageContentExtractor.getSortedPageContents(fileName);
|
||||||
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@ -3,13 +3,12 @@ package com.knecon.fforesight.service.layoutparser.server.services;
|
|||||||
import java.nio.file.Path;
|
import java.nio.file.Path;
|
||||||
import java.util.List;
|
import java.util.List;
|
||||||
|
|
||||||
import org.junit.jupiter.api.Disabled;
|
|
||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageContents;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageContents;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
import com.knecon.fforesight.service.layoutparser.processor.model.text.TextPositionSequence;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.utils.RectangleTransformations;
|
||||||
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
||||||
|
|
||||||
import lombok.SneakyThrows;
|
import lombok.SneakyThrows;
|
||||||
@ -27,14 +26,11 @@ class PageContentExtractorTest {
|
|||||||
|
|
||||||
PdfDraw.drawRectanglesPerPageNumberedByLine(fileName,
|
PdfDraw.drawRectanglesPerPageNumberedByLine(fileName,
|
||||||
textPositionPerPage.stream()
|
textPositionPerPage.stream()
|
||||||
.map(t -> t.getSortedTextPositionSequences()
|
.map(t -> t.getSortedTextPositionSequences().stream().map(TextPositionSequence::getRectangle).map(RectangleTransformations::toRectangle2D)
|
||||||
.stream()
|
|
||||||
.map(TextPositionSequence::getRectangle)
|
|
||||||
.map(RectangleTransformations::toRectangle2D)
|
|
||||||
//.map(textPositionSequence -> (Rectangle2D) new Rectangle2D.Double(textPositionSequence.getMaxXDirAdj(), textPositionSequence.getMaxYDirAdj(), textPositionSequence.getWidth(), textPositionSequence.getHeight()))
|
//.map(textPositionSequence -> (Rectangle2D) new Rectangle2D.Double(textPositionSequence.getMaxXDirAdj(), textPositionSequence.getMaxYDirAdj(), textPositionSequence.getWidth(), textPositionSequence.getHeight()))
|
||||||
.map(List::of)
|
.map(List::of).toList())
|
||||||
.toList())
|
.toList(),
|
||||||
.toList(), tmpFileName);
|
tmpFileName);
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
@ -7,8 +7,8 @@ import org.junit.jupiter.api.Disabled;
|
|||||||
import org.junit.jupiter.api.Test;
|
import org.junit.jupiter.api.Test;
|
||||||
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
|
import com.knecon.fforesight.service.layoutparser.processor.model.PageInformation;
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
|
||||||
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageContentExtractor;
|
||||||
|
import com.knecon.fforesight.service.layoutparser.processor.services.PageInformationService;
|
||||||
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
import com.knecon.fforesight.service.layoutparser.server.utils.visualizations.PdfDraw;
|
||||||
|
|
||||||
import lombok.SneakyThrows;
|
import lombok.SneakyThrows;
|
||||||
@ -38,6 +38,7 @@ class PageInformationServiceTest {
|
|||||||
System.out.printf("Finished drawing rectangles in %d ms%n", System.currentTimeMillis() - start);
|
System.out.printf("Finished drawing rectangles in %d ms%n", System.currentTimeMillis() - start);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@Test
|
@Test
|
||||||
@Disabled
|
@Disabled
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
|
|||||||
@ -1,5 +1,25 @@
|
|||||||
package com.knecon.fforesight.service.layoutparser.server.utils;
|
package com.knecon.fforesight.service.layoutparser.server.utils;
|
||||||
|
|
||||||
|
import java.io.InputStream;
|
||||||
|
import java.util.Optional;
|
||||||
|
|
||||||
|
import org.junit.jupiter.api.AfterEach;
|
||||||
|
import org.junit.jupiter.api.BeforeEach;
|
||||||
|
import org.junit.jupiter.api.extension.ExtendWith;
|
||||||
|
import org.springframework.amqp.rabbit.core.RabbitTemplate;
|
||||||
|
import org.springframework.beans.factory.annotation.Autowired;
|
||||||
|
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
|
||||||
|
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
|
||||||
|
import org.springframework.boot.test.context.SpringBootTest;
|
||||||
|
import org.springframework.boot.test.mock.mockito.MockBean;
|
||||||
|
import org.springframework.context.annotation.Bean;
|
||||||
|
import org.springframework.context.annotation.ComponentScan;
|
||||||
|
import org.springframework.context.annotation.Configuration;
|
||||||
|
import org.springframework.context.annotation.Import;
|
||||||
|
import org.springframework.context.annotation.Primary;
|
||||||
|
import org.springframework.core.io.ClassPathResource;
|
||||||
|
import org.springframework.test.context.junit.jupiter.SpringExtension;
|
||||||
|
|
||||||
import com.iqser.red.commons.jackson.ObjectMapperFactory;
|
import com.iqser.red.commons.jackson.ObjectMapperFactory;
|
||||||
import com.iqser.red.storage.commons.service.StorageService;
|
import com.iqser.red.storage.commons.service.StorageService;
|
||||||
import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService;
|
import com.iqser.red.storage.commons.utils.FileSystemBackedStorageService;
|
||||||
@ -9,22 +29,8 @@ import com.knecon.fforesight.service.layoutparser.processor.LayoutParsingStorage
|
|||||||
import com.knecon.fforesight.service.layoutparser.server.Application;
|
import com.knecon.fforesight.service.layoutparser.server.Application;
|
||||||
import com.knecon.fforesight.tenantcommons.TenantContext;
|
import com.knecon.fforesight.tenantcommons.TenantContext;
|
||||||
import com.knecon.fforesight.tenantcommons.TenantsClient;
|
import com.knecon.fforesight.tenantcommons.TenantsClient;
|
||||||
import lombok.SneakyThrows;
|
|
||||||
import org.junit.jupiter.api.AfterEach;
|
|
||||||
import org.junit.jupiter.api.BeforeEach;
|
|
||||||
import org.junit.jupiter.api.extension.ExtendWith;
|
|
||||||
import org.springframework.amqp.rabbit.core.RabbitTemplate;
|
|
||||||
import org.springframework.beans.factory.annotation.Autowired;
|
|
||||||
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
|
|
||||||
import org.springframework.boot.autoconfigure.amqp.RabbitAutoConfiguration;
|
|
||||||
import org.springframework.boot.test.context.SpringBootTest;
|
|
||||||
import org.springframework.boot.test.mock.mockito.MockBean;
|
|
||||||
import org.springframework.context.annotation.*;
|
|
||||||
import org.springframework.core.io.ClassPathResource;
|
|
||||||
import org.springframework.test.context.junit.jupiter.SpringExtension;
|
|
||||||
|
|
||||||
import java.io.InputStream;
|
import lombok.SneakyThrows;
|
||||||
import java.util.Optional;
|
|
||||||
|
|
||||||
@ExtendWith(SpringExtension.class)
|
@ExtendWith(SpringExtension.class)
|
||||||
@SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
|
@SpringBootTest(classes = Application.class, webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
|
||||||
@ -100,6 +106,7 @@ public abstract class AbstractTest {
|
|||||||
return buildDefaultLayoutParsingRequest(LayoutParsingType.REDACT_MANAGER);
|
return buildDefaultLayoutParsingRequest(LayoutParsingType.REDACT_MANAGER);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
protected LayoutParsingRequest buildDefaultLayoutParsingRequest(LayoutParsingType layoutParsingType) {
|
protected LayoutParsingRequest buildDefaultLayoutParsingRequest(LayoutParsingType layoutParsingType) {
|
||||||
|
|
||||||
return LayoutParsingRequest.builder()
|
return LayoutParsingRequest.builder()
|
||||||
@ -116,6 +123,7 @@ public abstract class AbstractTest {
|
|||||||
.build();
|
.build();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@SneakyThrows
|
@SneakyThrows
|
||||||
protected LayoutParsingRequest prepareStorage(String file, String cvServiceResponseFile, String imageInfoFile) {
|
protected LayoutParsingRequest prepareStorage(String file, String cvServiceResponseFile, String imageInfoFile) {
|
||||||
|
|
||||||
@ -152,7 +160,6 @@ public abstract class AbstractTest {
|
|||||||
@ComponentScan("com.knecon.fforesight.service.layoutparser")
|
@ComponentScan("com.knecon.fforesight.service.layoutparser")
|
||||||
public static class TestConfiguration {
|
public static class TestConfiguration {
|
||||||
|
|
||||||
|
|
||||||
@Bean
|
@Bean
|
||||||
@Primary
|
@Primary
|
||||||
public StorageService inmemoryStorage() {
|
public StorageService inmemoryStorage() {
|
||||||
|
|||||||
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user