Merge in RR/vidocp from table_parsing_version_2 to master
Squashed commit of the following:
commit af136ca10cf96f99699e409000ff598ce90c192e
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:34:01 2022 +0100
readme updated
commit 13ca7b1b03cb2bf7b3c8ef5821c1f8fa9ec532a0
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:32:11 2022 +0100
drawing color standardized
commit 654e961c62ddc0f512074e8238d7fa88f0ea227e
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:22:57 2022 +0100
refactoring
commit 964c17a36f7bbc1376dfe68f4ea90462d676e215
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:07:16 2022 +0100
readme updated
commit 4470969b35bb76e68cc41947fa02e63100b30ce9
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:05:35 2022 +0100
readme updated
commit a6c6bdb1e71a778a3c21a628cfb30acc5bc6086f
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:05:21 2022 +0100
readme updated
commit e178793dd69b720adefe7533312314e4c405f975
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 18:03:45 2022 +0100
readme updated
commit 443163864bab56930c2ef735c0aaafddd2561ead
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 17:59:03 2022 +0100
implememted clean solution for parsing open tables. still needs final refactoring.
Vidocp — Visual Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task.
Redaction Detection
The below snippet shows hot to find the outlines of previous redactions.
from vidocp.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py <path to pdf> 0 --type redaction
The below image shows the detected redactions with green outlines.
Description
Release 2.29.0
Latest
Languages
Python
91.1%
Shell
3%
Makefile
2.4%
Dockerfile
2.3%
Nix
1.2%

