Christoph Schabert 7706cfe973 Pull request #31: hotfix: fix key prepare
Merge in RR/cv-analysis from hotfix/keyPrep to master

Squashed commit of the following:

commit 58961a319b985cb5d658d867459340eafc0e7c04
Author: cschabert <christoph.schabert@iqser.com>
Date:   Tue Sep 20 11:25:15 2022 +0200

    hotfix: fix key prepare

commit d937ff7c7d5824e8a75956102bfe7cf24fb27305
Author: Julius Unverfehrt <Julius.Unverfehrt@iqser.com>
Date:   Wed Aug 24 15:22:10 2022 +0200

    Pull request #30: RED-5009 update pyinfra to support message rejection on unobtainable files

    Merge in RR/cv-analysis from RED-5009-update-pyinfra to master

    Squashed commit of the following:

    commit fe46f92494b7f00db2884e0f11cd1f4cc29d1675
    Merge: 35d1675 95cab33
    Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
    Date:   Wed Aug 24 15:20:16 2022 +0200

        Merge branch 'master' of ssh://git.iqser.com:2222/rr/cv-analysis into RED-5009-update-pyinfra

    commit 35d16759eb747467ce8deb88f8d953da0d4dc630
    Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
    Date:   Wed Aug 24 15:19:41 2022 +0200

        RED-5009 update pyinfra to support message rejection on unobtainable files
2022-09-20 11:56:35 +02:00
2022-07-27 10:50:10 +02:00
2022-03-23 14:35:00 +01:00
2022-08-03 15:04:27 +02:00

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Table Parsing Demonstration

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Redaction Detection Demonstration

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Layout Parsing Demonstration

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Figure Detection Demonstration

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf
Description
Analysis container service for visual (CV-based) document parsing
Readme 58 MiB
2025-01-16 09:31:10 +01:00
Languages
Python 91.1%
Shell 3%
Makefile 2.4%
Dockerfile 2.3%
Nix 1.2%