Merge in RR/vidocp from layout_detection_version_2 to master
Squashed commit of the following:
commit d443e95ad8143bed3efc74d9e38640498d8d16bf
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 20:16:13 2022 +0100
readme updated
commit 953ad696932454ce851544ed016f9e64bcc12080
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date: Sat Feb 5 20:14:59 2022 +0100
added layot parsing logic
89 lines
1.9 KiB
Markdown
89 lines
1.9 KiB
Markdown
# Vidocp — Visual Document Parsing
|
|
|
|
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
|
|
previous redactions in documents.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
|
|
cd vidocp
|
|
|
|
python -m venv env
|
|
source env/bin/activate
|
|
|
|
pip install -e .
|
|
pip install -r requirements.txt
|
|
|
|
dvc pull
|
|
```
|
|
|
|
## Usage
|
|
|
|
### As an API
|
|
|
|
The module provided functions for the individual tasks that all return some kid of collection of points, depending on
|
|
the specific task.
|
|
|
|
#### Redaction Detection
|
|
|
|
The below snippet shows hot to find the outlines of previous redactions.
|
|
|
|
```python
|
|
|
|
from vidocp.redaction_detection import find_redactions
|
|
import pdf2image
|
|
import numpy as np
|
|
|
|
|
|
pdf_path = ...
|
|
page_index = ...
|
|
|
|
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
|
|
page = np.array(page)
|
|
|
|
redaction_contours = find_redactions(page)
|
|
```
|
|
|
|
|
|
### As a CLI Tool
|
|
|
|
|
|
Core API functionalities can be used through a CLI.
|
|
|
|
|
|
#### Table Parsing
|
|
|
|
The tables parsing utility detects and segments tables into individual cells.
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 7 --type table
|
|
```
|
|
|
|
The below image shows a parsed table, where each table cell has been detected individually.
|
|
|
|

|
|
|
|
|
|
#### Redaction Detection
|
|
|
|
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
|
|
```bash
|
|
python scripts/annotate.py <path to pdf> 0 --type redaction
|
|
```
|
|
|
|
The below image shows the detected redactions with green outlines.
|
|
|
|

|
|
|
|
|
|
#### Layout Parsing
|
|
|
|
The layout parsing utility detects elements such as paragraphs, tables and figures.
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
|
|
```
|
|
|
|
The below image shows the detected layout elements on a page.
|
|
|
|

|