109 lines
4.4 KiB
Markdown
109 lines
4.4 KiB
Markdown
# PDF Layout Parser Micro-Service: layout-parser
|
|
|
|
## Introduction
|
|
|
|
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache
|
|
PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser
|
|
micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
|
|
|
|
### Key Steps in the PDF Layout Parsing Process:
|
|
|
|
* **Text Position Extraction:**
|
|
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document.
|
|
|
|
* **Word Segmentation and Text Block Formation:**
|
|
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks.
|
|
|
|
* **Text Block Classification:**
|
|
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties,
|
|
distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers.
|
|
|
|
* **Layout Coherence Establishment:**
|
|
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other
|
|
elements in a logical and structured manner.
|
|
|
|
* **Output Generation in Various Formats:**
|
|
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream
|
|
micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
|
|
|
|
### Optional Enhancements:
|
|
|
|
* **ML-Based Table Extraction:**
|
|
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON
|
|
file, which are then integrated seamlessly into the layout structure.
|
|
|
|
* **Image Classification using ML:**
|
|
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction
|
|
option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
|
|
|
|
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It
|
|
proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream
|
|
micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
Before building and using the layout-parser micro-service, please ensure you have the following software and tools installed:
|
|
|
|
Java Development Kit (JDK) 17 or later
|
|
Gradle build tool (preinstalled)
|
|
Build and Test
|
|
To build and test the micro-service, follow these steps:
|
|
|
|
### Clone the Repository:
|
|
|
|
bash
|
|
|
|
```
|
|
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
|
|
cd layout-parser
|
|
```
|
|
|
|
### Build the Project:
|
|
|
|
Use the following command to build the project using Gradle:
|
|
|
|
```
|
|
gradle clean build
|
|
```
|
|
|
|
### Run Tests:
|
|
|
|
Run the test suite using the following command:
|
|
|
|
```
|
|
gradle test
|
|
```
|
|
|
|
## Building a Custom Docker Image
|
|
|
|
To create a custom Docker image for the layout-parser micro-service, execute the provided script:
|
|
|
|
### Ensure Docker is Installed:
|
|
|
|
Ensure that Docker is installed and running on your system.
|
|
|
|
### Run the Image Building Script:
|
|
|
|
Execute the publish-custom-image script in the project directory:
|
|
|
|
```
|
|
./publish-custom-image
|
|
```
|
|
|
|
## Publishing to Internal Maven Repository
|
|
|
|
To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
|
|
|
|
```
|
|
gradle -Pversion=buildVersion publish
|
|
```
|
|
|
|
Replace buildVersion with the desired version number.
|
|
|
|
## Additional Notes
|
|
|
|
Make sure to configure any necessary application properties before deploying the micro-service.
|
|
For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.
|