Go to file

Julius Unverfehrt ce9e92876c Pull request #16 : Add table parsing fixtures

Merge in RR/cv-analysis from add_table_parsing_fixtures to master

Squashed commit of the following:

commit cfc89b421b61082c8e92e1971c9d0bf4490fa07e
Merge: a7ecb05 73c66a8
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Jul 11 12:19:01 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/cv-analysis into add_table_parsing_fixtures

commit a7ecb05b7d8327f0c7429180f63a380b61b06bc3
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Jul 11 12:02:07 2022 +0200

    refactor

commit 466f217e5a9ee5c54fd38c6acd28d54fc38ff9bb
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Mon Jul 11 10:24:14 2022 +0200

    deleted unused imports and unused lines of code

commit c58955c8658d0631cdd1c24c8556d399e3fd9990
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Mon Jul 11 10:16:01 2022 +0200

    black reformatted files

commit f8bcb10a00ff7f0da49b80c1609b17997411985a
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Jul 5 15:15:00 2022 +0200

    reformat files

commit 432e8a569fd70bd0745ce0549c2bfd2f2e907763
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Tue Jul 5 15:08:22 2022 +0200

    added better test for generic pages with table WIP as thicker lines create inconsistent results.
    added test for patchy tables which does not work yet

commit 2aac9ebf5c76bd963f8c136fe5dd4c2d7681b469
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Mon Jul 4 16:56:29 2022 +0200

    added new fixtures for table_parsing_test.py

commit 37606cac0301b13e99be2c16d95867477f29e7c4
Author: llocarnini <lillian.locarnini@iqser.com>
Date:   Fri Jul 1 16:02:44 2022 +0200

    added separate file for table parsing fixtures, where fixtures for generic tables were added. WIP tests for generic table fixtures

2022-07-11 12:25:16 +02:00

.dvc

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

bamboo-specs

update dependencies

2022-06-23 16:54:13 +02:00

cv_analysis

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

data

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

incl

Pull request #15 : Refactor logging

2022-07-11 09:36:57 +02:00

scripts

Pull request #13 : Add pdf coord conversion

2022-07-07 11:35:12 +02:00

setup

change name from vidocp to cv-analysis

2022-03-23 13:46:57 +01:00

src

Pull request #15 : Refactor logging

2022-07-11 09:36:57 +02:00

test

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

.coveragerc

Pull request #11 : Integrate new pyinfra

2022-06-23 14:45:08 +02:00

.dockerignore

add new files for containerization; still some work to do, but want to merge in tests first

2022-03-02 09:38:56 +01:00

.dvcignore

Pull request #4 : Restructuring and renaming of module

2022-02-05 16:14:24 +01:00

.gitignore

Pull request #16 : Add table parsing fixtures

2022-07-11 12:25:16 +02:00

.gitmodules

Pull request #11 : Integrate new pyinfra

2022-06-23 14:45:08 +02:00

config.yaml

Pull request #15 : Refactor logging

2022-07-11 09:36:57 +02:00

Dockerfile

Pull request #13 : Add pdf coord conversion

2022-07-07 11:35:12 +02:00

Dockerfile_base

Pull request #11 : Integrate new pyinfra

2022-06-23 14:45:08 +02:00

pytest.ini

Pull request #11 : Integrate new pyinfra

2022-06-23 14:45:08 +02:00

README.md

tiny change to test build server

2022-03-23 14:35:00 +01:00

requirements.txt

Pull request #12 : Diff font sizes on page

2022-06-30 14:50:58 +02:00

setup.py

change name from vidocp to cv-analysis

2022-03-23 13:46:57 +01:00

sonar-project.properties

fixed build config minutia

2022-03-22 14:06:15 +01:00

README.md

cv-analysis — Visual (CV-Based) Document Parsing

This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents.

Installation

git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git
cd cv-analysis

python -m venv env
source env/bin/activate

pip install -e .
pip install -r requirements.txt

dvc pull

Usage

As an API

The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task.

Redaction Detection (API)

The below snippet shows hot to find the outlines of previous redactions.

from cv_analysis.redaction_detection import find_redactions
import pdf2image 
import numpy as np


pdf_path = ...
page_index = ...

page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]
page = np.array(page)

redaction_contours = find_redactions(page)

As a CLI Tool

Core API functionalities can be used through a CLI.

Table Parsing

The tables parsing utility detects and segments tables into individual cells.

python scripts/annotate.py data/test_pdf.pdf 7 --type table

The below image shows a parsed table, where each table cell has been detected individually.

Redaction Detection (CLI)

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

python scripts/annotate.py data/test_pdf.pdf 2 --type redaction

The below image shows the detected redactions with green outlines.

Layout Parsing

The layout parsing utility detects elements such as paragraphs, tables and figures.

python scripts/annotate.py data/test_pdf.pdf 7 --type layout

The below image shows the detected layout elements on a page.

Figure Detection

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

python scripts/annotate.py data/test_pdf.pdf 3 --type figure

The below image shows the detected figure on a page.

Running as a service

Building

Build base image

bash setup/docker.sh

Build head image

docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""

Usage (service)

Shell 1

docker run --rm --net=host --rm cv-analysis

Shell 2

python scripts/client_mock.py --pdf_path /path/to/a/pdf

Releases 37

Release 2.29.0 Latest

2025-01-16 09:31:10 +01:00

Languages

Python 91.1%

Shell 3%

Makefile 2.4%

Dockerfile 2.3%

Nix 1.2%