Merge in RR/cv-analysis from evaluate_layout_detection to master
Squashed commit of the following:
commit 8ec2f69fc61d1e15bd502b0a2c1f720cbec2b34e
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Tue Aug 23 15:07:21 2022 +0200
repaired is_not_included() logic (did drop the outer rectangle, not the included)
commit 97be081d1e60989313924ceac0bfb3062229411e
Merge: 2c28fa2 2b5c4f1
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Tue Aug 23 14:28:14 2022 +0200
Merge branch 'master' of ssh://git.iqser.com:2222/rr/cv-analysis into evaluate_layout_detection
commit 2c28fa280b7eff922c715245fffe69702c7e6742
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Tue Aug 23 13:50:17 2022 +0200
del print statements
commit c60121fc4faebc5de556ec0ab7a3af4f815f7ce1
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Mon Aug 22 10:51:52 2022 +0200
few changes to connect_rects.py
commit a99719905d58cbe856fa020177abd7e317c1d072
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Thu Aug 18 08:37:12 2022 +0200
layout parsing improved with connect_rects.py
commit d693688a0f0d63395cfd36645de7b3417f64de30
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Tue Aug 2 09:31:19 2022 +0200
removed vizlogger instances
cv-analysis — Visual (CV-Based) Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.
Redaction Detection (API)
The below snippet shows hot to find the outlines of previous redactions.
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection (CLI)
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.
Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
The below image shows the detected figure on a page.
Running as a service
Building
Build base image
bash setup/docker.sh
Build head image
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
Usage (service)
Shell 1
docker run --rm --net=host --rm cv-analysis
Shell 2
python scripts/client_mock.py --pdf_path /path/to/a/pdf



