Merge in RR/cv-analysis from add-pdf-coord-conversion to master
Squashed commit of the following:
commit f56b7b45feb78142b032ef0faae2ca8dd020e6c5
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jul 7 11:26:46 2022 +0200
update pyinfra
commit 9086ef0a2059688fb8dd5559cda831bbbd36362b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jul 7 11:21:53 2022 +0200
update inpout metadata keys
commit 55f147a5848e22ea62242ea883a0ce53ef1c04a5
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jul 7 09:16:16 2022 +0200
update to new input metadata signature
commit df4652fb027f734f2613e4adb7bc5b17edee62e9
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jul 6 16:55:36 2022 +0200
refactor
commit e52c674085a9c7411c55a2e0993aa34622284317
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Jul 6 16:15:21 2022 +0200
update build script, refactor
commit 1f874aea591f25544aaa3f39a4e38fa50a24615e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jul 5 17:01:15 2022 +0200
add rotation formatter
commit b78a69741287a4cd38a90ace98f67e8f1b803737
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Jul 5 09:26:27 2022 +0200
refactor
commit b3155b8e072530f99114f3ee9135e73afc8f85cb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jul 1 15:06:45 2022 +0200
made assertion robust to floating point precision
commit 4169102a6b5053500a3db2d789d265c2c77d56a4
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jul 1 15:06:01 2022 +0200
improve banner
commit dea74593d925c802489e5400297b48a9729038f0
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jul 1 14:28:08 2022 +0200
introduce derotation logic for rectangles from rotated pdfs, introduce continious option for coordinates in Rectangle class
commit d07e1dc2731ea7ae9887cc02bb98155bf1565a0d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jul 1 10:39:38 2022 +0200
introduce table parsing formatter to convert pixel values to inches
commit 67ff6730dd7073a0fc9e9698904325dea9537c5b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Jul 1 08:06:42 2022 +0200
fixed duplicate logging
commit 6c025409415329028f697bb99986cd0912c7ed54
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Jun 30 17:10:32 2022 +0200
add pyinfra mock script
cv-analysis — Visual (CV-Based) Document Parsing
This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.
Installation
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis
python -m venv env
source env/bin/activate
pip install -e .
pip install -r requirements.txt
dvc pull
Usage
As an API
The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.
Redaction Detection (API)
The below snippet shows hot to find the outlines of previous redactions.
from cv_analysis.redaction_detection import find_redactions
import pdf2image
import numpy as np
pdf_path = ...
page_index = ...
page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)
redaction_contours = find_redactions(page)
As a CLI Tool
Core API functionalities can be used through a CLI.
Table Parsing
The tables parsing utility detects and segments tables into individual cells.
python scripts/annotate.py data/test_pdf.pdf 7 --type table
The below image shows a parsed table, where each table cell has been detected individually.
Redaction Detection (CLI)
The redaction detection utility detects previous redactions in PDFs (filled black rectangles).
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction
The below image shows the detected redactions with green outlines.
Layout Parsing
The layout parsing utility detects elements such as paragraphs, tables and figures.
python scripts/annotate.py data/test_pdf.pdf 7 --type layout
The below image shows the detected layout elements on a page.
Figure Detection
The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.
python scripts/annotate.py data/test_pdf.pdf 3 --type figure
The below image shows the detected figure on a page.
Running as a service
Building
Build base image
bash setup/docker.sh
Build head image
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""
Usage (service)
Shell 1
docker run --rm --net=host --rm cv-analysis
Shell 2
python scripts/client_mock.py --pdf_path /path/to/a/pdf



