Go to file

lillian locarnini 95cab33f19 Pull request #29 : Evaluate layout detection

Merge in RR/cv-analysis from evaluate_layout_detection to master

Squashed commit of the following:

commit 8ec2f69fc61d1e15bd502b0a2c1f720cbec2b34e
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Aug 23 15:07:21 2022 +0200

    repaired is_not_included() logic (did drop the outer rectangle, not the included)

commit 97be081d1e60989313924ceac0bfb3062229411e
Merge: 2c28fa2 2b5c4f1
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Aug 23 14:28:14 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/cv-analysis into evaluate_layout_detection

commit 2c28fa280b7eff922c715245fffe69702c7e6742
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Aug 23 13:50:17 2022 +0200

    del print statements

commit c60121fc4faebc5de556ec0ab7a3af4f815f7ce1
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Mon Aug 22 10:51:52 2022 +0200

    few changes to connect_rects.py

commit a99719905d58cbe856fa020177abd7e317c1d072
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Thu Aug 18 08:37:12 2022 +0200

    layout parsing improved with connect_rects.py

commit d693688a0f0d63395cfd36645de7b3417f64de30
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Aug 2 09:31:19 2022 +0200

    removed vizlogger instances

2022-08-23 15:09:51 +02:00

.dvc

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

bamboo-specs

add executible rights

2022-08-18 09:10:03 +02:00

cv_analysis

Pull request #29 : Evaluate layout detection

2022-08-23 15:09:51 +02:00

data

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

incl

Pull request #28 : queue callback: add storage lookup for input file, add should_publish flag to signal processing success to queue manager

2022-08-23 10:43:34 +02:00

scripts

Pull request #29 : Evaluate layout detection

2022-08-23 15:09:51 +02:00

src

Pull request #28 : queue callback: add storage lookup for input file, add should_publish flag to signal processing success to queue manager

2022-08-23 10:43:34 +02:00

test

Pull request #27 : Image service compat

2022-08-16 17:04:05 +02:00

.coveragerc

Pull request #11 : Integrate new pyinfra

2022-06-23 14:45:08 +02:00

.dockerignore

add new files for containerization; still some work to do, but want to merge in tests first

2022-03-02 09:38:56 +01:00

.dvcignore

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

.gitignore

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

.gitmodules

Pull request #23 : Add pdf2image module

2022-08-02 13:36:50 +02:00

docker-compose.yaml

Pull request #20 : New pyinfra

2022-07-27 10:50:10 +02:00

Dockerfile

Pull request #23 : Add pdf2image module

2022-08-02 13:36:50 +02:00

pytest.ini

Pull request #20 : New pyinfra

2022-07-27 10:50:10 +02:00

README.md

tiny change to test build server

2022-03-23 14:35:00 +01:00

requirements.txt

Pull request #17 : Add pdf2array func

2022-07-20 11:01:55 +02:00

setup.py

change name from vidocp to cv-analysis

2022-03-23 13:46:57 +01:00

sonar-project.properties

fixed build config minutia

2022-03-22 14:06:15 +01:00

version.yaml

RED-4758: Adjust buildjob

2022-08-03 15:04:27 +02:00

README.md

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf

Releases 37

Release 2.29.0 Latest

2025-01-16 09:31:10 +01:00

Languages

Python 91.1%

Shell 3%

Makefile 2.4%

Dockerfile 2.3%

Nix 1.2%