Julius Unverfehrt 6d1ca4d6a3 Pull request #11: Integrate new pyinfra
Merge in RR/cv-analysis from integrate-new-pyinfra to master

Squashed commit of the following:

commit f27b7eb342838b7a235a062a04363dc417f859ad
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 14:24:03 2022 +0200

    refactor table test

commit 9f57cc7d72bffc106c852041666b2f11eb6eacc3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 14:07:37 2022 +0200

    debug bamboo

commit 30911cc5a34559a8b622634ddf974a9860481d17
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 13:22:04 2022 +0200

    track test data with dvc

commit 501460c3c99482879ae585872bd67fd67693c47a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 13:19:39 2022 +0200

    untrack test data

commit f65ade167802901a6f402618c062df0120279df3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 12:02:43 2022 +0200

    refactor&extend tests

commit 8c9dc41ddeda5b0f630a267e328d1c09f69bdb04
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 09:36:26 2022 +0200

    debug bamboo

commit f0b38130502475cf9bfa8632d3b0eb3a84b32b7d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 09:27:42 2022 +0200

    debug bamboo

commit 0f188b4eb5293cf2bc4024fb397f161ad3b867bd
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 09:23:38 2022 +0200

    update build script

commit 281e13d822790deefa3d1a4f2519d300d84cded3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 09:21:31 2022 +0200

    refactor tests

commit e90e84cb3b13b2903611985cc9eb3b5b7bf0262e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 08:54:29 2022 +0200

    parametrize analysis_fn for server logic, refactor tests

commit 20734bcd14fec489e80ea6900dba64de4b190398
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Jun 23 08:53:16 2022 +0200

    oursource tests from module

commit cd2c41762df1a231f2ed1d43c3b71d2443530ffa
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 14:26:36 2022 +0200

    add tests for analyse server logic

commit 16497ac4ec8b0d7064f6d8dd887c189f0d955a1d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 11:36:34 2022 +0200

    debug build script

commit 45688c1c6d9b738cce519edcdc044aae3b800cd1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 11:33:13 2022 +0200

    debug build script

commit 0576140916c0cd9d290dd02225621e5360665d71
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 10:51:51 2022 +0200

    update tests

commit fcbecdde95cef46bce46545af65d040cc918447b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 10:04:30 2022 +0200

    rename operations, update requirements

commit 7b40f6d643bb332fd7dd0867d64f17db16ede5bb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Jun 22 10:03:48 2022 +0200

    adjust deployment scripts

commit b66f937d2e0abc79e68bce6ee058bc0bd5cb86e5
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 13:32:44 2022 +0200

    refactor server logic, use operation2function logic for pyinfra server

commit 5e7247f85cacaa6c0643796a98f13642db3e59e1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Jun 20 17:23:11 2022 +0200

    add server logic for pyinfra 2

commit eecb985fed76af9404bd99f0104508efe7d75e35
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Jun 20 16:24:05 2022 +0200

    add server logic for pyinfra 2.0.0

... and 3 more commits
2022-06-23 14:45:08 +02:00
2022-05-24 08:01:42 +02:00
2022-06-13 13:04:15 +02:00
2022-03-23 14:35:00 +01:00

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Table Parsing Demonstration

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Redaction Detection Demonstration

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Layout Parsing Demonstration

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Figure Detection Demonstration

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf
Description
Analysis container service for visual (CV-based) document parsing
Readme 58 MiB
2025-01-16 09:31:10 +01:00
Languages
Python 91.1%
Shell 3%
Makefile 2.4%
Dockerfile 2.3%
Nix 1.2%