Merge in RR/cv-analysis from diff-font-sizes-on-page to master
Squashed commit of the following:
commit d1b32a3e8fadd45d38040e1ba96672ace240ae29
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 30 14:43:30 2022 +0200
add tests for figure detection first iteration
commit c38a7701afaad513320f157fe7188b3f11a682ac
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 30 14:26:08 2022 +0200
update text tests with new test cases
commit ccc0c1a177c7d69c9575ec0267a492c3eef008e3
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Wed Jun 29 23:09:24 2022 +0200
added fixture for different scaled text on page and parameter for different font style
commit 5f36a634caad2849e673de7d64abb5b6c3a6055f
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 17:03:52 2022 +0200
add pdf2pdf annotate script for figure detection
commit 7438c170371e166e82ab19f9dfdf1bddd89b7bb3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 16:24:52 2022 +0200
optimize algorithm
commit 93bf8820f856d3815bab36b13c0df189c45d01e0
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 16:11:15 2022 +0200
black
commit 59c639eec7d3f9da538b0ad6cd6215456c92eb58
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 16:10:39 2022 +0200
add tests for figure detection pipeline
commit bada688d88231843e9d299d255d9c4e0d5ca9788
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 13:34:36 2022 +0200
refactor tests
commit 614388a18b46d670527727c11f63e8174aed3736
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 13:34:14 2022 +0200
introduce pipeline logic for figure detection
commit 7195f892d543294829aebe80e260b4395b89cb36
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 11:58:41 2022 +0200
update reqs
commit 4408e7975853196c5e363dd2ddf62e15fe6f4944
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 11:56:16 2022 +0200
add figure detection test
commit 5ff472c2d96238ca2bc1d2368d3d02e62db98713
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 11:56:09 2022 +0200
add figure detection test
commit 66c1307e57c84789d64cb8e41d8e923ac98eebde
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 10:36:50 2022 +0200
refactor draw boxes to work as intended on inversed image
commit 00a39050d051ae43b2a8f2c4efd6bfbd2609dead
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 28 10:36:11 2022 +0200
refactor module structure
commit f8af01894c387468334a332e75f7dbf545a91f86
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Jun 27 17:07:47 2022 +0200
add: figure detection now agnostic to input image background color, refactor tests
commit 3bc63da783bced571d53b29b6d82648c9f93e886
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Jun 27 14:31:15 2022 +0200
add text removal tests
commit 6e794a7cee3fd7633aa5084839775877b0f8794c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Jun 27 12:12:27 2022 +0200
figure detection tests WIP
commit f8b20d4c9845de6434142e3dab69ce467fbc7a75
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jun 24 15:39:37 2022 +0200
add tests for figure_detection WIP
commit f2a52a07a5e261962214dff40ba710c93993f6fb
Author: llocarnini <lillian.locarnini@iqser.com>
Date: Fri Jun 24 14:28:44 2022 +0200
added third test case "figure_and_text"
commit 8f45c88278cdcd32a121ea8269c8eca816bffd0b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jun 24 13:25:17 2022 +0200
add tests for figure_detection
cv-analysis — Visual (CV-Based) Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.
Redaction Detection (API)
The below snippet shows hot to find the outlines of previous redactions.
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection (CLI)
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.
Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
The below image shows the detected figure on a page.
Running as a service
Building
Build base image
bash setup/docker.sh
Build head image
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
Usage (service)
Shell 1
docker run --rm --net=host --rm cv-analysis
Shell 2
python scripts/client_mock.py --pdf_path /path/to/a/pdf



