Go to file

Matthias Bisping bb5707dc89 Pull request #6 : added layout parsing logic

Merge in RR/vidocp from layout_detection_version_2 to master

Squashed commit of the following:

commit d443e95ad8143bed3efc74d9e38640498d8d16bf
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sat Feb 5 20:16:13 2022 +0100

    readme updated

commit 953ad696932454ce851544ed016f9e64bcc12080
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sat Feb 5 20:14:59 2022 +0100

    added layot parsing logic

2022-02-05 20:17:14 +01:00

.dvc

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

data

Pull request #6 : added layout parsing logic

2022-02-05 20:17:14 +01:00

scripts

Pull request #6 : added layout parsing logic

2022-02-05 20:17:14 +01:00

vidocp

Pull request #6 : added layout parsing logic

2022-02-05 20:17:14 +01:00

.dvcignore

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

README.md

Pull request #6 : added layout parsing logic

2022-02-05 20:17:14 +01:00

requirements.txt

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

setup.py

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

README.md

Vidocp — Visual Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task.

Redaction Detection

The below snippet shows hot to find the outlines of previous redactions.


from vidocp.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Redaction Detection

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py <path to pdf> 0 --type redaction

The below image shows the detected redactions with green outlines.

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Releases 37

Release 2.29.0 Latest

2025-01-16 09:31:10 +01:00

Languages

Python 91.1%

Shell 3%

Makefile 2.4%

Dockerfile 2.3%

Nix 1.2%