Merge in RR/cv-analysis from integrate-new-pyinfra to master
Squashed commit of the following:
commit f27b7eb342838b7a235a062a04363dc417f859ad
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 14:24:03 2022 +0200
refactor table test
commit 9f57cc7d72bffc106c852041666b2f11eb6eacc3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 14:07:37 2022 +0200
debug bamboo
commit 30911cc5a34559a8b622634ddf974a9860481d17
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 13:22:04 2022 +0200
track test data with dvc
commit 501460c3c99482879ae585872bd67fd67693c47a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 13:19:39 2022 +0200
untrack test data
commit f65ade167802901a6f402618c062df0120279df3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 12:02:43 2022 +0200
refactor&extend tests
commit 8c9dc41ddeda5b0f630a267e328d1c09f69bdb04
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 09:36:26 2022 +0200
debug bamboo
commit f0b38130502475cf9bfa8632d3b0eb3a84b32b7d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 09:27:42 2022 +0200
debug bamboo
commit 0f188b4eb5293cf2bc4024fb397f161ad3b867bd
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 09:23:38 2022 +0200
update build script
commit 281e13d822790deefa3d1a4f2519d300d84cded3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 09:21:31 2022 +0200
refactor tests
commit e90e84cb3b13b2903611985cc9eb3b5b7bf0262e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 08:54:29 2022 +0200
parametrize analysis_fn for server logic, refactor tests
commit 20734bcd14fec489e80ea6900dba64de4b190398
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 23 08:53:16 2022 +0200
oursource tests from module
commit cd2c41762df1a231f2ed1d43c3b71d2443530ffa
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 14:26:36 2022 +0200
add tests for analyse server logic
commit 16497ac4ec8b0d7064f6d8dd887c189f0d955a1d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 11:36:34 2022 +0200
debug build script
commit 45688c1c6d9b738cce519edcdc044aae3b800cd1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 11:33:13 2022 +0200
debug build script
commit 0576140916c0cd9d290dd02225621e5360665d71
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 10:51:51 2022 +0200
update tests
commit fcbecdde95cef46bce46545af65d040cc918447b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 10:04:30 2022 +0200
rename operations, update requirements
commit 7b40f6d643bb332fd7dd0867d64f17db16ede5bb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jun 22 10:03:48 2022 +0200
adjust deployment scripts
commit b66f937d2e0abc79e68bce6ee058bc0bd5cb86e5
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jun 21 13:32:44 2022 +0200
refactor server logic, use operation2function logic for pyinfra server
commit 5e7247f85cacaa6c0643796a98f13642db3e59e1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Jun 20 17:23:11 2022 +0200
add server logic for pyinfra 2
commit eecb985fed76af9404bd99f0104508efe7d75e35
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Jun 20 16:24:05 2022 +0200
add server logic for pyinfra 2.0.0
... and 3 more commits
cv-analysis — Visual (CV-Based) Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.
Redaction Detection (API)
The below snippet shows hot to find the outlines of previous redactions.
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection (CLI)
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.
Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
The below image shows the detected figure on a page.
Running as a service
Building
Build base image
bash setup/docker.sh
Build head image
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
Usage (service)
Shell 1
docker run --rm --net=host --rm cv-analysis
Shell 2
python scripts/client_mock.py --pdf_path /path/to/a/pdf



