# cv-analysis - Visual (CV-Based) Document Parsing parse_pdf() This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents. ## API Input message: ```json { "targetFilePath": { "pdf": "absolute file path", "vlp_output": "absolute file path" }, "responseFilePath": "absolute file path", "operation": "table_image_inference" } ``` Response is uploaded to the storage as specified in the `responseFilePath` field. The structure is as follows: ```json { ..., "data": [ { 'pageNum': 0, 'bbox': { 'x1': 55.3407, 'y1': 247.0246, 'x2': 558.5602, 'y2': 598.0585 }, 'uuid': '2b10c1a2-393c-4fca-b9e3-0ad5b774ac84', 'label': 'table', 'tableLines': [ { 'x1': 0, 'y1': 16, 'x2': 1399, 'y2': 16 }, ... ], 'imageInfo': { 'height': 693, 'width': 1414 } }, ... ] } ``` ## Installation ```bash git clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git cd cv-analysis python -m venv env source env/bin/activate pip install -e . pip install -r requirements.txt dvc pull ``` ## Usage ### As an API The module provided functions for the individual tasks that all return some kind of collection of points, depending on the specific task. #### Redaction Detection (API) The below snippet shows hot to find the outlines of previous redactions. ```python from cv_analysis.redaction_detection import find_redactions import pdf2image import numpy as np pdf_path = ... page_index = ... page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0] page = np.array(page) redaction_contours = find_redactions(page) ``` ## As a CLI Tool Core API functionalities can be used through a CLI. ### Table Parsing The tables parsing utility detects and segments tables into individual cells. ```bash python scripts/annotate.py data/test_pdf.pdf 7 --type table ``` The below image shows a parsed table, where each table cell has been detected individually. ![Table Parsing Demonstration](data/table_parsing.png) ### Redaction Detection (CLI) The redaction detection utility detects previous redactions in PDFs (filled black rectangles). ```bash python scripts/annotate.py data/test_pdf.pdf 2 --type redaction ``` The below image shows the detected redactions with green outlines. ![Redaction Detection Demonstration](data/redaction_detection.png) ### Layout Parsing The layout parsing utility detects elements such as paragraphs, tables and figures. ```bash python scripts/annotate.py data/test_pdf.pdf 7 --type layout ``` The below image shows the detected layout elements on a page. ![Layout Parsing Demonstration](data/layout_parsing.png) ### Figure Detection The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility. ```bash python scripts/annotate.py data/test_pdf.pdf 3 --type figure ``` The below image shows the detected figure on a page. ![Figure Detection Demonstration](data/figure_detection.png) ## Running as a service ### Building Build base image ```bash bash setup/docker.sh ``` Build head image ```bash docker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT="" ``` ### Usage (service) Shell 1 ```bash docker run --rm --net=host --rm cv-analysis ``` Shell 2 ```bash python scripts/client_mock.py --pdf_path /path/to/a/pdf ```