cv-analysis-service/README.md

# cv-analysis - Visual (CV-Based) Document Parsing

parse_pdf()
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
previous redactions in documents.

## API

Input message:

```json
{
  "targetFilePath": {
    "pdf": "absolute file path",
    "vlp_output": "absolute file path"
  },
  "responseFilePath": "absolute file path",
  "operation": "table_image_inference"
}
```

Response is uploaded to the storage as specified in the `responseFilePath` field. The structure is as follows:

```json
{
  ...,
  "data": [
    {
      'pageNum': 0,
      'bbox': {
        'x1': 55.3407,
        'y1': 247.0246,
        'x2': 558.5602,
        'y2': 598.0585
      },
      'uuid': '2b10c1a2-393c-4fca-b9e3-0ad5b774ac84',
      'label': 'table',
      'tableLines': [
        {
          'x1': 0,
          'y1': 16,
          'x2': 1399,
          'y2': 16
        },
        ...
      ],
      'imageInfo': {
        'height': 693,
        'width': 1414
      }
    },
    ...
  ]
}

```

## Installation

```bash
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull
```

## Usage

### As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on
the specific task.

#### Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

```python
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np

pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)
```

## As a CLI Tool

Core API functionalities can be used through a CLI.

### Table Parsing

The tables parsing utility detects and segments tables into individual cells.

```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type table
```

The below image shows a parsed table, where each table cell has been detected individually.

![Table Parsing Demonstration](data/table_parsing.png)

### Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

```bash
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
```

The below image shows the detected redactions with green outlines.

![Redaction Detection Demonstration](data/redaction_detection.png)

### Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

```bash
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
```

The below image shows the detected layout elements on a page.

![Layout Parsing Demonstration](data/layout_parsing.png)

### Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

```bash
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
```

The below image shows the detected figure on a page.

![Figure Detection Demonstration](data/figure_detection.png)

## Running as a service

### Building

Build base image

```bash
bash setup/docker.sh
```

Build head image

```bash
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
```

### Usage (service)

Shell 1

```bash
docker run --rm --net=host --rm cv-analysis
```

Shell 2

```bash
python scripts/client_mock.py --pdf_path /path/to/a/pdf
```