Merge in RR/vidocp from layout_detection_version_3 to master
Squashed commit of the following:
commit 262b1c14c0b8b164221d39fd286b20914d1a8e6a
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 22:56:10 2022 +0100
comment
commit 975dcdaae2b0e9bfcb075fe1c87adc48175c0d93
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 22:50:41 2022 +0100
applied black
commit 49ba3b5f318a1b5d6bb39c0b53de5e237a87da96
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 22:48:44 2022 +0100
improved layout parsing logic: filtering of included rects
commit d78ac24c10793f72b569c3c827834400b730888a
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 22:36:49 2022 +0100
improved layout parsing logic: filtering of overlaps, no sub-text regions
Vidocp — Visual Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task.
Redaction Detection
The below snippet shows hot to find the outlines of previous redactions.
from vidocp.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.


