{ "body": "
\n

cv-analysis - Visual (CV-Based) Document Parsing#

\n

parse_pdf()\nThis repository implements computer vision based approaches for detecting and parsing visual features such as tables or\nprevious redactions in documents.

\n
\n

API#

\n

Input message:

\n
{\n  "targetFilePath": {\n    "pdf": "absolute file path",\n    "vlp_output": "absolute file path"\n  },\n  "responseFilePath": "absolute file path",\n  "operation": "table_image_inference"\n}\n
\n
\n

Response is uploaded to the storage as specified in the responseFilePath field. The structure is as follows:

\n
{\n  ...,\n  "data": [\n    {\n      'pageNum': 0,\n      'bbox': {\n        'x1': 55.3407,\n        'y1': 247.0246,\n        'x2': 558.5602,\n        'y2': 598.0585\n      },\n      'uuid': '2b10c1a2-393c-4fca-b9e3-0ad5b774ac84',\n      'label': 'table',\n      'tableLines': [\n        {\n          'x1': 0,\n          'y1': 16,\n          'x2': 1399,\n          'y2': 16\n        },\n        ...\n      ],\n      'imageInfo': {\n        'height': 693,\n        'width': 1414\n      }\n    },\n    ...\n  ]\n}\n
\n
\n
\n
\n

Installation#

\n
git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git\ncd cv-analysis\n\npython -m venv env\nsource env/bin/activate\n\npip install -e .\npip install -r requirements.txt\n\ndvc pull\n
\n
\n
\n
\n

Usage#

\n
\n

As an API#

\n

The module provided functions for the individual tasks that all return some kind of collection of points, depending on\nthe specific task.

\n
\n

Redaction Detection (API)#

\n

The below snippet shows hot to find the outlines of previous redactions.

\n
from cv_analysis.redaction_detection import find_redactions\nimport pdf2image\nimport numpy as np\n\npdf_path = ...\npage_index = ...\n\npage = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]\npage = np.array(page)\n\nredaction_contours = find_redactions(page)\n
\n
\n
\n
\n
\n
\n

As a CLI Tool#

\n

Core API functionalities can be used through a CLI.

\n
\n

Table Parsing#

\n

The tables parsing utility detects and segments tables into individual cells.

\n
python scripts/annotate.py data/test_pdf.pdf 7 --type table\n
\n
\n

The below image shows a parsed table, where each table cell has been detected individually.

\n

\"Table

\n
\n
\n

Redaction Detection (CLI)#

\n

The redaction detection utility detects previous redactions in PDFs (filled black rectangles).

\n
python scripts/annotate.py data/test_pdf.pdf 2 --type redaction\n
\n
\n

The below image shows the detected redactions with green outlines.

\n

\"Redaction

\n
\n
\n

Layout Parsing#

\n

The layout parsing utility detects elements such as paragraphs, tables and figures.

\n
python scripts/annotate.py data/test_pdf.pdf 7 --type layout\n
\n
\n

The below image shows the detected layout elements on a page.

\n

\"Layout

\n
\n
\n

Figure Detection#

\n

The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility.

\n
python scripts/annotate.py data/test_pdf.pdf 3 --type figure\n
\n
\n

The below image shows the detected figure on a page.

\n

\"Figure

\n
\n
\n
\n

Running as a service#

\n
\n

Building#

\n

Build base image

\n
bash setup/docker.sh\n
\n
\n

Build head image

\n
docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""\n
\n
\n
\n
\n

Usage (service)#

\n

Shell 1

\n
docker run --rm --net=host --rm cv-analysis\n
\n
\n

Shell 2

\n
python scripts/client_mock.py --pdf_path /path/to/a/pdf\n
\n
\n
\n
\n
\n", "title": "cv-analysis - Visual (CV-Based) Document Parsing", "sourcename": "README.md.txt", "current_page_name": "README", "toc": "\n", "page_source_suffix": ".md" }