RED-10127: Paragraphs with multiple table, appendix, figure can't be headlines See merge request fforesight/layout-parser!257
PDF Layout Parser Micro-Service: layout-parser
Introduction
The layout-parser micro-service is a powerful tool designed to efficiently extract structured information from PDF documents. Written in Java and utilizing Spring Boot 3, Apache PDFBox, and RabbitMQ, this micro-service excels at parsing PDFs and organizing their content into a meaningful and coherent layout structure. Notably, the layout-parser micro-service distinguishes itself by relying solely on advanced algorithms, rather than machine learning techniques.
Key Steps in the PDF Layout Parsing Process:
-
Text Position Extraction:
The micro-service leverages Apache PDFBox to extract precise text positions for each individual character within the PDF document. -
Word Segmentation and Text Block Formation:
Employing an array of diverse algorithms, the micro-service initially identifies and segments words, creating distinct text blocks. -
Text Block Classification:
The segmented text blocks are then subjected to classification algorithms. These algorithms categorize the text blocks based on their content and visual properties, distinguishing between sections, subsections, headlines, paragraphs, images, tables, table cells, headers, and footers. -
Layout Coherence Establishment:
The classified text blocks are subsequently orchestrated into a cohesive layout structure. This process involves arranging sections, subsections, paragraphs, images, and other elements in a logical and structured manner. -
Output Generation in Various Formats:
Once the layout structure is established, the micro-service generates output in multiple formats. These formats are designed for seamless integration with downstream micro-services. The supported formats include JSON, XML, and others, ensuring flexibility in downstream data consumption.
Optional Enhancements:
-
ML-Based Table Extraction:
For enhanced results, users have the option to incorporate machine learning-based table extraction. This feature can be activated by providing ML-generated results as a JSON file, which are then integrated seamlessly into the layout structure. -
Image Classification using ML:
Additionally, for more accurate image classification, users can optionally feed ML-generated image classification results into the micro-service. Similar to the table extraction option, the micro-service processes the pre-parsed results in JSON format, thus optimizing the accuracy of image content identification.
In conclusion, the layout-parser micro-service is a versatile PDF layout parsing solution crafted entirely around advanced algorithms, without reliance on machine learning. It proficiently extracts text positions, segments content into meaningful blocks, classifies these blocks, arranges them coherently, and outputs structured data for downstream micro-services. Optional integration with ML-generated table extractions and image classifications further enhances its capabilities.
Installation
Prerequisites
Before building and using the layout-parser micro-service, please ensure you have the following software and tools installed:
Java Development Kit (JDK) 17 or later Gradle build tool (preinstalled) Build and Test To build and test the micro-service, follow these steps:
Clone the Repository:
bash
git clone ssh://git@git.knecon.com:22222/fforesight/layout-parser.git
cd layout-parser
Build the Project:
Use the following command to build the project using Gradle:
gradle clean build
Run Tests:
Run the test suite using the following command:
gradle test
Building a Custom Docker Image
To create a custom Docker image for the layout-parser micro-service, execute the provided script:
Ensure Docker is Installed:
Ensure that Docker is installed and running on your system.
Run the Image Building Script:
Execute the publish-custom-image script in the project directory:
./publish-custom-image
Publishing to Internal Maven Repository
To publish the layout-parser micro-service to your internal Maven repository, execute the following command:
gradle -Pversion=buildVersion publish
Replace buildVersion with the desired version number.
Additional Notes
Make sure to configure any necessary application properties before deploying the micro-service. For advanced usage and configurations, refer to Kilian or Dom or preferably the source code.