Matthias Bisping 3113d5cb5d Refactoring
Squashed commit of the following:

commit e5832a17356cebd43846c0542ce595bba5a8cdda
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 13 14:08:17 2023 +0100

    reduce pytest parameter combinatons

commit a1e6c9e553545ed1fc4c017e67dddaa98fc2a1c9
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:56:16 2023 +0100

    clear color map cache per pytest parameter combination

commit 21a9db25cdb55b967c664f5d129a9ac35aa1da0f
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:38:52 2023 +0100

    Remove obsolete line

commit 90c367cc325dd3a4d3b8f7f37e06a79c30207867
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:38:05 2023 +0100

    Refactoring: Move

commit 42d285e35b82ba0f36835eff6ff70c50bd80d20c
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:33:44 2023 +0100

    Refactoring: Move

    Move content generator into its own module

commit ddc92461d7442e08921408707ada6963f555f708
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:29:59 2023 +0100

    Refactoring: Move

    Move remaining segment generation functions into segments module

commit d2cb78d38f47a8c705a82dd725e24c0540a29710
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:27:26 2023 +0100

    Refactoring: Move

    Move zipmap and evert_nth into utils module

commit 9c401a977ce0749463cb2af509f412007f37a084
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:26:01 2023 +0100

    Refactoring: Move

    Move rectangle shrinking logic into new morphing module

commit b77951d4feb1e5dacdb32f0d36a399f6f94b2293
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:22:15 2023 +0100

    Refactoring: Move

    Move segment generation functions into their own module

commit c7b224a98a355f93653a0d576a10fbd2507ed1d8
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:14:54 2023 +0100

    Refactoring: Move

    Move cell class into its own module

commit f0072b0852f34f0448d467fc4993eee3a23a6c5b
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:12:18 2023 +0100

    Refactoring: Move

    Move table generation related code into new table module

commit 9fd87aff8ea69404959056b3d58c7f8856527c83
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 18:07:36 2023 +0100

    Refactoring: Move

    - Move random plot into its own module
    - Move geometric predicates into their own module

commit 6728642a4fc07ec9c47db99efe12981c18f95ee5
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:59:54 2023 +0100

    Refactoring: Move

    Mode random helper functions

commit cc86a79ac7bc47e5ddb68e5c95327eebc97041d9
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:33:51 2023 +0100

    Refactoring: Move

    Move text block generator module into text module

commit 160d5b3473d7e4f6f6dbb8fcf51cf554d6b54543
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:29:29 2023 +0100

    Remove unused code

commit 7b2f921472bb47b5c5d7848393ae471664eab583
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:28:17 2023 +0100

    Refactoring: Move

    Move text block generators into their own module

commit e258df899f4be39beec4a0bfc01eaea105218adb
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:24:54 2023 +0100

    Refactoring: Move

    Move text block into its own module

commit cef97b33f920488857c308e6ebcbc5a309de4b20
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:20:30 2023 +0100

    Refactoring: Move

    Move page partitioners into partitioner module

commit a54ccb2fdf44595720718fef44d5d3b1b8cbfe0a
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:15:40 2023 +0100

    Refactoring: Move

    Move text generation funtions into their own module

commit 1de938f2faa50cb805d7ebea3075c1d6d969d254
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:07:33 2023 +0100

    Refactoring: Move

    Move font related functions into font module

commit de9b3bad93d91b2d1820b59403fc357e243238e6
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 17:05:47 2023 +0100

    Refactoring: Move

    Move font picker into new font module

commit 9480d58a8a77b3feb7206cb1b7ac5c8a25516b39
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:59:31 2023 +0100

    Refactoring: Move

    Move line formatters into their own module

commit cc0094d3f73b258a0b89353981529e7fa6978b53
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:54:08 2023 +0100

    Refactoring: Move

    Move random content rectangle into its own module

commit 93a52080df8f5aa39b3b29f2c9a8dcbc8d72ad9d
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:52:57 2023 +0100

    Remove unused code

commit 4ec3429dec932cadd828376610950b8ad84a51f4
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:51:03 2023 +0100

    Refactoring: Move

    Move page partitioner into its own module

commit bdcb2f1bef36357ea048c4f00b9dccfa25b13bd9
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:42:55 2023 +0100

    Refactoring: Move

commit 845d1691949dcba049737af29fcee735825ecb8f
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:39:39 2023 +0100

    Refactoring

commit 56c10490b965ccf3ca81aa9ba0403d9068871688
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:36:21 2023 +0100

    Refactoring

commit 740a9cb3c25710a46452fa28dbef011daa03d6ed
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:33:32 2023 +0100

    Refactoring

commit b3cf3e44548c71e7eff90e94ce8ce671a0d8f343
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:29:03 2023 +0100

    Refactoring

    Add fixture for page partitioner

commit 2fb450943e74d0a2a49ca0e20c9507d0230e4373
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:25:50 2023 +0100

    Refactoring: Move

commit fd76933b5ac1fbab1b508ef1f3f199d04189cf81
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:16:16 2023 +0100

    Refactoring: Move

    Move image operations such as blurring into their own module.

commit 809590054315266286c75fb0ef2f81b506aaf20c
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 16:10:48 2023 +0100

    Fix effectless bug

commit d42f053c81105e3144fcc54a7c6e924c777b3665
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 13:22:16 2023 +0100

    Refactoring: Re-order

commit 04a617b9df0ee62e73f87508c8b09c4d3817a6e3
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Wed Feb 1 13:19:25 2023 +0100

    Refactoring

    Move content rectangle base class
2023-02-13 14:12:34 +01:00
2023-02-13 14:12:34 +01:00
2023-01-09 17:22:01 +01:00
2023-01-04 17:36:06 +01:00
2023-02-13 14:12:34 +01:00
2023-02-13 14:12:34 +01:00
2023-02-01 11:32:12 +01:00
2023-02-01 11:32:12 +01:00
2022-07-27 10:50:10 +02:00
2022-03-23 14:35:00 +01:00
2022-08-03 15:04:27 +02:00

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Table Parsing Demonstration

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Redaction Detection Demonstration

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Layout Parsing Demonstration

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Figure Detection Demonstration

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf
Description
Analysis container service for visual (CV-based) document parsing
Readme 58 MiB
2025-01-16 09:31:10 +01:00
Languages
Python 91.1%
Shell 3%
Makefile 2.4%
Dockerfile 2.3%
Nix 1.2%