Matthias Bisping 8432cfe514 Pull request #8: figure detection
Merge in RR/vidocp from text_removal to master

Squashed commit of the following:

commit b65374c512ce9ba07fa522d591c83db3de5d7d55
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 01:03:12 2022 +0100

    readme updated

commit 1c1f7a395a00fa505cf19e1ad87d8c34faa6ef5b
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 01:00:46 2022 +0100

    figure detection version 1 completed

commit f257660823ef8682e9fedda9921ad946ef2ade76
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 00:37:03 2022 +0100

    wip

commit 2e89b28f4a69da80570597c823b3b7a591788d0a
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Sun Feb 6 00:23:56 2022 +0100

    wip
2022-02-06 01:04:15 +01:00

100 lines
2.2 KiB
Markdown

# Vidocp &mdash; Visual Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
previous redactions in documents.
## Installation
```bash
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
```
## Usage
### As an API
The module provided functions for the individual tasks that all return some kid of collection of points, depending on
the specific task.
#### Redaction Detection
The below snippet shows hot to find the outlines of previous redactions.
```python
from vidocp.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
```
### As a CLI Tool
Core API functionalities can be used through a CLI.
#### Table Parsing
The tables parsing utility detects and segments tables into individual cells.
```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type table
```
The below image shows a parsed table, where each table cell has been detected individually.
![](data/table_parsing.png)
#### Redaction Detection
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
```bash
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
```
The below image shows the detected redactions with green outlines.
![](data/redaction_detection.png)
#### Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
```
The below image shows the detected layout elements on a page.
![](data/layout_parsing.png)
#### Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
```bash
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
```
The below image shows the detected figure on a page.
![](data/figure_detection.png)