cv-analysis-service/README.md

# Vidocp &mdash; Visual Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
previous redactions in documents.

## Installation

```bash
git clone ssh://git@git.iqser.com:2222/rr/vidocp.git
cd vidocp

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull
```

## Usage

### As an API

The module provided functions for the individual tasks that all return some kid of collection of points, depending on
the specific task.

#### Redaction Detection

The below snippet shows hot to find the outlines of previous redactions.

```python

from vidocp.redaction_detection import find_redactions
import pdf2image
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)
```


### As a CLI Tool


Core API functionalities can be used through a CLI.


#### Table Parsing

The tables parsing utility detects and segments tables into individual cells.
```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type table
```

The below image shows a parsed table, where each table cell has been detected individually.

![](data/table_parsing.png)


#### Redaction Detection

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
```bash
python scripts/annotate.py <path to pdf> 0 --type redaction
```

The below image shows the detected redactions with green outlines.

![](data/redaction_detection.png)


#### Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.
```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
```

The below image shows the detected layout elements on a page.

![](data/layout_parsing.png)