Go to file

Matthias Bisping c9b2f6bf29 Pull request #9 : Refactoring

Merge in RR/vidocp from refactoring to master

Squashed commit of the following:

commit 36a62a13e51148d2420cb12930e84d78629db6b0
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:54:53 2022 +0100

    refactoring

commit e652da1fa88a048f9a5211b4e8c0b96074fb5849
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:53:17 2022 +0100

    refactoring

commit d9567da428c81f9cd7971a657281df0a90166810
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:47:18 2022 +0100

    refactoring

commit 9d30009dceec0357db6499bfaffae8ce97718ee0
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:45:53 2022 +0100

    refactoring

commit e8863d67aaaff138fb088c4e496a91b6354cc059
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:42:45 2022 +0100

    refactoring

commit 89a99d3586db4fbafa743a45bdd02eaf0c1f341f
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:39:49 2022 +0100

    refactoring

commit aa66b6865b00b0490b9e7695a6bae386e6f96723
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:31:21 2022 +0100

    refactoring

commit 98d77cb522a08821c3a13ae2cffbe7239c654762
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:27:55 2022 +0100

    refactoring

commit fed3a7e4f1b8b7ca4e14f9e495459c26490fb50b
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:26:16 2022 +0100

    refactoring

commit 504cafbd5d4bba183d9943b36c60548aae34e402
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:25:44 2022 +0100

    renaming

commit c9780a57e5a048529d36958ba678eddb11759cef
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:24:41 2022 +0100

    removed obsolete import

commit d555e86475e82024f8e1a5fc5b0ac70faa091ee1
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 14:24:04 2022 +0100

    refactored figure detection once

2022-02-06 14:55:38 +01:00

.dvc

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

data

Pull request #8 : figure detection

2022-02-06 01:04:15 +01:00

scripts

Pull request #9 : Refactoring

2022-02-06 14:55:38 +01:00

vidocp

Pull request #9 : Refactoring

2022-02-06 14:55:38 +01:00

.dvcignore

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

README.md

Pull request #8 : figure detection

2022-02-06 01:04:15 +01:00

requirements.txt

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

setup.py

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

README.md

Vidocp — Visual Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task.

Redaction Detection

The below snippet shows hot to find the outlines of previous redactions.

from vidocp.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Redaction Detection

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Releases 37

Release 2.29.0 Latest

2025-01-16 09:31:10 +01:00

Languages

Python 91.1%

Shell 3%

Makefile 2.4%

Dockerfile 2.3%

Nix 1.2%