Fix: The cell width and height were rounded to int in the table constructor. The imprecison of rounding would accumulate when stacking cells in a row or columns leading to gaps at the bottom and right hand edge of tables. The rounding has now been removed and left to the cell constructor. Cells are derived from the Rectangle class, which does the rounding itself. This eliminates the issue with accumulated gaps in the tables.
cv-analysis — Visual (CV-Based) Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.
Redaction Detection (API)
The below snippet shows hot to find the outlines of previous redactions.
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection (CLI)
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.
Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
The below image shows the detected figure on a page.
Running as a service
Building
Build base image
bash setup/docker.sh
Build head image
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
Usage (service)
Shell 1
docker run --rm --net=host --rm cv-analysis
Shell 2
python scripts/client_mock.py --pdf_path /path/to/a/pdf



