Julius Unverfehrt 016abe46de Pull request #23: Add pdf2image module
Merge in RR/cv-analysis from add-pdf2image-module to master

Squashed commit of the following:

commit 13355e2dd006fae9ee05c2d00acbbc8b38fd1e8e
Merge: eaf4627 edbda58
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 13:35:27 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/cv-analysis into add-pdf2image-module

commit eaf462768787642889d496203034d017c4ec959b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 13:26:58 2022 +0200

    update build scripts

commit d429c713f4e5e74afca81c2354e8125bf389b865
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 13:11:07 2022 +0200

    purge target

commit 349b81c5db724bf70d6f31b58ded2b5414216bfe
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 13:07:58 2022 +0200

    Revert "extinguish target"

    This reverts commit d2bd4cefde0648d2487839b0344509b984435273.

commit d2bd4cefde0648d2487839b0344509b984435273
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 12:57:50 2022 +0200

    extinguish target

commit 5f6cc713db31e3e16c8e7f13a59804c86b5d77d7
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 11:58:52 2022 +0200

    refactor

commit 576019378a39b580b816d9eb7957774f1faf48b9
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 11:52:04 2022 +0200

    add test for adjustesd server analysis pipeline logic

commit bdf0121929d6941cbba565055f37df7970925c79
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 11:30:17 2022 +0200

    update analysis pipline logic to use imported pdf2image

commit f7cef98d5e6d7b95517bbd047dd3e958acebb3d8
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 2 11:04:34 2022 +0200

    add pdf2image as git submodule
2022-08-02 13:36:50 +02:00
2022-07-27 10:50:10 +02:00
2022-07-27 10:50:10 +02:00
2022-07-27 10:50:10 +02:00
2022-03-23 14:35:00 +01:00

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Table Parsing Demonstration

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Redaction Detection Demonstration

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Layout Parsing Demonstration

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Figure Detection Demonstration

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf
Description
Analysis container service for visual (CV-based) document parsing
Readme 58 MiB
2025-01-16 09:31:10 +01:00
Languages
Python 91.1%
Shell 3%
Makefile 2.4%
Dockerfile 2.3%
Nix 1.2%