# Vidocp — Visual Document Parsing This repository implements computer vision based approaches for detecting and parsing visual features such as tables or previous redactions in documents. ## Installation ```bash git clone ssh://git@git.iqser.com:2222/rr/vidocp.git cd vidocp python -m venv env source env/bin/activate pip install -e . pip install -r requirements.txt dvc pull ``` ## Usage ### As an API The module provided functions for the individual tasks that all return some kid of collection of points, depending on the specific task. #### Redaction Detection The below snippet shows hot to find the outlines of previous redactions. ```python from vidocp.redaction_detection import find_redactions import pdf2image import numpy as np pdf_path = ... page_index = ... page = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0] page = np.array(page) redaction_contours = find_redactions(page) ``` ### As a CLI Tool Core API functionalities can be used through a CLI. #### Table Parsing The tables parsing utility detects and segments tables into individual cells. ```bash python scripts/annotate.py data/test_pdf.pdf 7 --type table ``` The below image shows a parsed table, where each table cell has been detected individually. ![](data/table_parsing.png) #### Redaction Detection The redaction detection utility detects previous redactions in PDFs (filled black rectangles). ```bash python scripts/annotate.py data/test_pdf.pdf 2 --type redaction ``` The below image shows the detected redactions with green outlines. ![](data/redaction_detection.png) #### Layout Parsing The layout parsing utility detects elements such as paragraphs, tables and figures. ```bash python scripts/annotate.py data/test_pdf.pdf 7 --type layout ``` The below image shows the detected layout elements on a page. ![](data/layout_parsing.png) #### Figure Detection The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility. ```bash python scripts/annotate.py data/test_pdf.pdf 3 --type figure ``` The below image shows the detected figure on a page. ![](data/figure_detection.png)