{
"body": " parse_pdf()\nThis repository implements computer vision based approaches for detecting and parsing visual features such as tables or\nprevious redactions in documents. Input message: Response is uploaded to the storage as specified in the The module provided functions for the individual tasks that all return some kind of collection of points, depending on\nthe specific task. The below snippet shows hot to find the outlines of previous redactions. Core API functionalities can be used through a CLI. The tables parsing utility detects and segments tables into individual cells. The below image shows a parsed table, where each table cell has been detected individually. The redaction detection utility detects previous redactions in PDFs (filled black rectangles). The below image shows the detected redactions with green outlines. The layout parsing utility detects elements such as paragraphs, tables and figures. The below image shows the detected layout elements on a page. The figure detection utility detects figures specifically, which can be missed by the generic layout parsing utility. The below image shows the detected figure on a page. Build base image Build head image Shell 1 Shell 2cv-analysis - Visual (CV-Based) Document Parsing#
\nAPI#
\n{\n "targetFilePath": {\n "pdf": "absolute file path",\n "vlp_output": "absolute file path"\n },\n "responseFilePath": "absolute file path",\n "operation": "table_image_inference"\n}\n
responseFilePath field. The structure is as follows:{\n ...,\n "data": [\n {\n 'pageNum': 0,\n 'bbox': {\n 'x1': 55.3407,\n 'y1': 247.0246,\n 'x2': 558.5602,\n 'y2': 598.0585\n },\n 'uuid': '2b10c1a2-393c-4fca-b9e3-0ad5b774ac84',\n 'label': 'table',\n 'tableLines': [\n {\n 'x1': 0,\n 'y1': 16,\n 'x2': 1399,\n 'y2': 16\n },\n ...\n ],\n 'imageInfo': {\n 'height': 693,\n 'width': 1414\n }\n },\n ...\n ]\n}\n
Installation#
\ngit clone ssh://git@git.iqser.com:2222/rr/cv-analysis.git\ncd cv-analysis\n\npython -m venv env\nsource env/bin/activate\n\npip install -e .\npip install -r requirements.txt\n\ndvc pull\n
Usage#
\nAs an API#
\nRedaction Detection (API)#
\nfrom cv_analysis.redaction_detection import find_redactions\nimport pdf2image\nimport numpy as np\n\npdf_path = ...\npage_index = ...\n\npage = pdf2image.convert_from_path(pdf_path, first_page=page_index, last_page=page_index)[0]\npage = np.array(page)\n\nredaction_contours = find_redactions(page)\n
As a CLI Tool#
\nTable Parsing#
\npython scripts/annotate.py data/test_pdf.pdf 7 --type table\n

Redaction Detection (CLI)#
\npython scripts/annotate.py data/test_pdf.pdf 2 --type redaction\n

Layout Parsing#
\npython scripts/annotate.py data/test_pdf.pdf 7 --type layout\n

Figure Detection#
\npython scripts/annotate.py data/test_pdf.pdf 3 --type figure\n

Running as a service#
\nBuilding#
\nbash setup/docker.sh\ndocker build -f Dockerfile -t cv-analysis . --build-arg BASE_ROOT=""\n
Usage (service)#
\ndocker run --rm --net=host --rm cv-analysis\n
python scripts/client_mock.py --pdf_path /path/to/a/pdf\n