cv-analysis-service/README.md

# cv-analysis &mdash; Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
previous redactions in documents.

## Installation

```bash
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull
```

## Usage

### As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on
the specific task.

#### Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

```python
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)
```

## As a CLI Tool

Core API functionalities can be used through a CLI.

### Table Parsing

The tables parsing utility detects and segments tables into individual cells.

```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type table
```

The below image shows a parsed table, where each table cell has been detected individually.

![Table Parsing Demonstration](data/table_parsing.png)

### Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

```bash
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
```

The below image shows the detected redactions with green outlines.

![Redaction Detection Demonstration](data/redaction_detection.png)

### Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
```

The below image shows the detected layout elements on a page.

![Layout Parsing Demonstration](data/layout_parsing.png)

### Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

```bash
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
```

The below image shows the detected figure on a page.

![Figure Detection Demonstration](data/figure_detection.png)

## Running as a service

### Building

Build base image

```bash
bash setup/docker.sh
```

Build head image

```bash
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
```

### Usage (service)

Shell 1

```bash
docker run --rm --net=host --rm cv-analysis
```

Shell 2

```bash
python scripts/client_mock.py --pdf_path /path/to/a/pdf
```