179 lines
3.5 KiB
Plaintext
179 lines
3.5 KiB
Plaintext
# cv-analysis - Visual (CV-Based) Document Parsing
|
|
|
|
parse_pdf()
|
|
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or
|
|
previous redactions in documents.
|
|
|
|
## API
|
|
|
|
Input message:
|
|
|
|
```json
|
|
{
|
|
"targetFilePath": {
|
|
"pdf": "absolute file path",
|
|
"vlp_output": "absolute file path"
|
|
},
|
|
"responseFilePath": "absolute file path",
|
|
"operation": "table_image_inference"
|
|
}
|
|
```
|
|
|
|
Response is uploaded to the storage as specified in the `responseFilePath` field. The structure is as follows:
|
|
|
|
```json
|
|
{
|
|
...,
|
|
"data": [
|
|
{
|
|
'pageNum': 0,
|
|
'bbox': {
|
|
'x1': 55.3407,
|
|
'y1': 247.0246,
|
|
'x2': 558.5602,
|
|
'y2': 598.0585
|
|
},
|
|
'uuid': '2b10c1a2-393c-4fca-b9e3-0ad5b774ac84',
|
|
'label': 'table',
|
|
'tableLines': [
|
|
{
|
|
'x1': 0,
|
|
'y1': 16,
|
|
'x2': 1399,
|
|
'y2': 16
|
|
},
|
|
...
|
|
],
|
|
'imageInfo': {
|
|
'height': 693,
|
|
'width': 1414
|
|
}
|
|
},
|
|
...
|
|
]
|
|
}
|
|
|
|
```
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
|
|
cd cv-analysis
|
|
|
|
python -m venv env
|
|
source env/bin/activate
|
|
|
|
pip install -e .
|
|
pip install -r requirements.txt
|
|
|
|
dvc pull
|
|
```
|
|
|
|
## Usage
|
|
|
|
### As an API
|
|
|
|
The module provided functions for the individual tasks that all return some kind of collection of points, depending on
|
|
the specific task.
|
|
|
|
#### Redaction Detection (API)
|
|
|
|
The below snippet shows hot to find the outlines of previous redactions.
|
|
|
|
```python
|
|
from cv_analysis.redaction_detection import find_redactions
|
|
import pdf2image
|
|
import numpy as np
|
|
|
|
pdf_path = ...
|
|
page_index = ...
|
|
|
|
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
|
|
page = np.array(page)
|
|
|
|
redaction_contours = find_redactions(page)
|
|
```
|
|
|
|
## As a CLI Tool
|
|
|
|
Core API functionalities can be used through a CLI.
|
|
|
|
### Table Parsing
|
|
|
|
The tables parsing utility detects and segments tables into individual cells.
|
|
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 7 --type table
|
|
```
|
|
|
|
The below image shows a parsed table, where each table cell has been detected individually.
|
|
|
|

|
|
|
|
### Redaction Detection (CLI)
|
|
|
|
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
|
|
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
|
|
```
|
|
|
|
The below image shows the detected redactions with green outlines.
|
|
|
|

|
|
|
|
### Layout Parsing
|
|
|
|
The layout parsing utility detects elements such as paragraphs, tables and figures.
|
|
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
|
|
```
|
|
|
|
The below image shows the detected layout elements on a page.
|
|
|
|

|
|
|
|
### Figure Detection
|
|
|
|
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
|
|
|
|
```bash
|
|
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
|
|
```
|
|
|
|
The below image shows the detected figure on a page.
|
|
|
|

|
|
|
|
## Running as a service
|
|
|
|
### Building
|
|
|
|
Build base image
|
|
|
|
```bash
|
|
bash setup/docker.sh
|
|
```
|
|
|
|
Build head image
|
|
|
|
```bash
|
|
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
|
|
```
|
|
|
|
### Usage (service)
|
|
|
|
Shell 1
|
|
|
|
```bash
|
|
docker run --rm --net=host --rm cv-analysis
|
|
```
|
|
|
|
Shell 2
|
|
|
|
```bash
|
|
python scripts/client_mock.py --pdf_path /path/to/a/pdf
|
|
```
|