Merge in RR/vidocp from poly_to_rects_segmentation to master
Squashed commit of the following:
commit 3dffe067ef0bb4796eab22007eb6970b29f47822
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 16:10:28 2022 +0100
readme updated
commit 448517205259134a8427b48d86d0d5331b726487
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 16:09:35 2022 +0100
restructured dirs
commit 058c2971631c71d520b1a94ea75e249f9234ad87
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 15:57:08 2022 +0100
renaming
commit 4e64a3d07f1dad76775955639157ec7b60e6ad38
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 15:46:03 2022 +0100
readme updated
commit 728bedb13a2769b4652fd674ef26988efebcc7dc
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 15:33:42 2022 +0100
added DVC
commit e2d5594afd6683d8207007d3a85d178dd0a3e546
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 14:49:09 2022 +0100
renaming
Vidocp
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
Usage
As an API
The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task. Example for finding the outlines of previous redactions.
from vidocp.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
Example outputs from demo script:
Table parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py <path to pdf> 1 --type table
Detect redactions
The redaction detection utility detects previous redactions in PDFs (black filled rectangles).
python scripts/annotate.py <path to pdf> 0 --type redaction
The below image shows the detected redactions with green outlines.
Description
Release 2.29.0
Latest
Languages
Python
91.1%
Shell
3%
Makefile
2.4%
Dockerfile
2.3%
Nix
1.2%
