Pull request #9: Tdd refactoring

Merge in RR/image-prediction from tdd_refactoring to master

Squashed commit of the following:

commit f6c64430007590f5d2b234a7f784e26025d06484
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 12:18:47 2022 +0200

    renaming

commit 8f40b51282191edf3e2a5edcd6d6acb388ada453
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 12:07:18 2022 +0200

    adjusted expetced output for alpha channel in response

commit 7e666302d5eadb1e84b70cae27e8ec6108d7a135
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:52:51 2022 +0200

    added alpha channel check result to response

commit a6b9f64b51cd888fc0c427a38bd43ae2ae2cb051
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:27:57 2022 +0200

    readme updated

commit 0d06ad657e3c21dcef361c53df37b05aba64528b
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:19:35 2022 +0200

    readme updated and config

commit 75748a1d82f0ebdf3ad7d348c6d820c8858aa3cb
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:19:26 2022 +0200

    refactoring

commit 60101337828d11f5ee5fed0d8c4ec80cde536d8a
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:18:23 2022 +0200

    multiple reoutes for prediction

commit c8476cb5f55e470b831ae4557a031a2c1294eb86
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:17:49 2022 +0200

    add banner.txt to container

commit 26ef5fce8a9bc015f1c35f32d40e8bea50a96454
Author: Matthias Bisping <Matthias.Bisping@iqser.com>
Date:   Mon Apr 25 10:08:49 2022 +0200

    Pull request #8: Pipeline refactoring

    Merge in RR/image-prediction from pipeline_refactoring to tdd_refactoring

    Squashed commit of the following:

    commit 6989fcb3313007b7eecf4bba39077fcde6924a9a
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 09:49:49 2022 +0200

        removed obsolete module

    commit 7428aeee37b11c31cffa597c85b018ba71e79a1d
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 09:45:45 2022 +0200

        refactoring

    commit 0dcd3894154fdf34bd3ba4ef816362434474f472
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 08:57:21 2022 +0200

        refactoring; removed obsolete extractor-classifier

    commit 1078aa81144f4219149b3fcacdae8b09c4b905c0
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:18:10 2022 +0200

        removed obsolete imports

    commit 71f61fc5fc915da3941cf5ed5d9cc90fccc49031
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:16:25 2022 +0200

        comment changed

    commit b582726cd1de233edb55c5a76c91e99f9dd3bd13
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:12:11 2022 +0200

        refactoring

    commit 8abc9010048078868b235d6793ac6c8b20abb985
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:25:47 2022 +0200

        formatting

    commit 2c87c419fe3185a25c27139e7fcf79f60971ad24
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:24:05 2022 +0200

        formatting

    commit 50b161192db43a84464125c6d79650225e1010d6
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:20:18 2022 +0200

        refactoring

    commit 9a1446cccfa070852a5d9c0bdbc36037b82541fc
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:04:57 2022 +0200

        refactoring

    commit 6c10b55ff8e61412cb2fe5a5625e660ecaf1d7d1
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 19:48:05 2022 +0200

        refactoring

    commit 72e785e3e31c132ab352119e9921725f91fac9e2
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 19:43:39 2022 +0200

        refactoring

    commit f036ee55e6747daf31e3929bdc2d93dc5f2a56ca
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Wed Apr 20 18:30:41 2022 +0200

        refactoring pipeline WIP

commit 120721f5f1a7e910c0c2ebc79dc87c2908794c80
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:39:58 2022 +0200

    rm debug ls

commit 81226d4f8599af0db0e9718fbb1789cfad91a855
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:28:27 2022 +0200

    no compose down

commit 943f7799d49b6a6b0fed985a76ed4fe725dfaeef
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:22:17 2022 +0200

    coverage combine

commit d4cd96607157ea414db417cfd7133f56cb56afe1
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 14:43:09 2022 +0200

    model builder path in mlruns adjusted

commit 5b90bb47c3421feb6123c179eb68d1125d58ff1e
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:56:58 2022 +0200

    dvc pull in test running script

commit a935cacf2305a4a78a15ff571f368962f4538369
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:50:36 2022 +0200

    no clean working dir

commit ba09df7884485b8ab8efbf42a8058de9af60c75c
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:43:22 2022 +0200

    debug ls

commit 71263a9983dbfe2060ef5b74de7cc2cbbad43416
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 09:11:03 2022 +0200

    debug ls

commit 41fbadc331e65e4ffe6d053e2d925e5e0543d8b7
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 20:08:08 2022 +0200

    debug echo

commit bb19698d640b3a99ea404e5b4b06d719a9bfe9e9
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 20:01:59 2022 +0200

    skip server predict test

commit 5094015a87fc0976c9d3ff5d1f4c6fdbd96b7eae
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 19:05:50 2022 +0200

    sonar stage after build stage

... and 253 more commits
This commit is contained in:
Matthias Bisping 2022-04-25 12:25:41 +02:00
parent eb18ae8719
commit ddd8d4685e
144 changed files with 3735 additions and 459 deletions

View File

@ -1,6 +1,9 @@
# .coveragerc to control coverage.py # .coveragerc to control coverage.py
[run] [run]
branch = True branch = True
parallel = True
command_line = -m pytest
concurrency = multiprocessing
omit = omit =
*/site-packages/* */site-packages/*
*/distutils/* */distutils/*
@ -11,9 +14,11 @@ omit =
*/env/* */env/*
*/build_venv/* */build_venv/*
*/build_env/* */build_env/*
*/utils/banner.py
*/utils/logger.py
*/src/*
source = source =
image_prediction image_prediction
src
relative_files = True relative_files = True
data_file = .coverage data_file = .coverage
@ -44,6 +49,10 @@ omit =
*/env/* */env/*
*/build_venv/* */build_venv/*
*/build_env/* */build_env/*
*/utils/banner.py
*/utils/logger.py
*/src/*
*/pdf_annotation.py
ignore_errors = True ignore_errors = True

View File

@ -1,5 +1,6 @@
[core] [core]
remote = vector remote = vector
autostage = true
['remote "vector"'] ['remote "vector"']
url = ssh://vector.iqser.com/research/image_service/ url = ssh://vector.iqser.com/research/image-prediction/
port = 22 port = 22

6
.gitignore vendored
View File

@ -32,6 +32,9 @@
**/classpath-data.json **/classpath-data.json
**/dependencies-and-licenses-overview.txt **/dependencies-and-licenses-overview.txt
.coverage
.coverage\.*\.*
*__pycache__ *__pycache__
*.egg-info* *.egg-info*
@ -44,7 +47,6 @@
*misc *misc
/coverage_html_report/ /coverage_html_report/
.coverage
# Created by https://www.toptal.com/developers/gitignore/api/linux,pycharm # Created by https://www.toptal.com/developers/gitignore/api/linux,pycharm
# Edit at https://www.toptal.com/developers/gitignore?templates=linux,pycharm # Edit at https://www.toptal.com/developers/gitignore?templates=linux,pycharm
@ -171,5 +173,3 @@ fabric.properties
.idea/codestream.xml .idea/codestream.xml
# End of https://www.toptal.com/developers/gitignore/api/linux,pycharm # End of https://www.toptal.com/developers/gitignore/api/linux,pycharm
/image_prediction/data/mlruns/
/data/mlruns/

3
.gitmodules vendored
View File

@ -1,3 +0,0 @@
[submodule "incl/redai_image"]
path = incl/redai_image
url = ssh://git@git.iqser.com:2222/rr/redai_image.git

View File

@ -1,23 +1,19 @@
ARG BASE_ROOT="nexus.iqser.com:5001/red/" FROM image-prediction-base
ARG VERSION_TAG="latest"
FROM ${BASE_ROOT}image-prediction-base:${VERSION_TAG}
WORKDIR /app/service WORKDIR /app/service
COPY src src COPY src src
COPY data data COPY data data
COPY image_prediction image_prediction COPY image_prediction image_prediction
COPY incl/redai_image/redai incl/redai_image/redai
COPY setup.py setup.py COPY setup.py setup.py
COPY requirements.txt requirements.txt COPY requirements.txt requirements.txt
COPY config.yaml config.yaml COPY config.yaml config.yaml
COPY banner.txt banner.txt
# Install dependencies differing from base image. # Install dependencies differing from base image.
RUN python3 -m pip install -r requirements.txt RUN python3 -m pip install -r requirements.txt
RUN python3 -m pip install -e . RUN python3 -m pip install -e .
RUN python3 -m pip install -e incl/redai_image/redai
EXPOSE 5000 EXPOSE 5000
EXPOSE 8080 EXPOSE 8080

23
Dockerfile_tests Normal file
View File

@ -0,0 +1,23 @@
ARG BASE_ROOT="nexus.iqser.com:5001/red/"
ARG VERSION_TAG="dev"
FROM ${BASE_ROOT}image-prediction:${VERSION_TAG}
WORKDIR /app/service
COPY src src
COPY data data
COPY image_prediction image_prediction
COPY setup.py setup.py
COPY requirements.txt requirements.txt
COPY config.yaml config.yaml
# Install module & dependencies
RUN python3 -m pip install -e .
RUN python3 -m pip install -r requirements.txt
RUN apt update --yes
RUN apt install vim --yes
RUN apt install poppler-utils --yes
CMD coverage run -m pytest test/ --tb=native -q -s -vvv -x && coverage combine && coverage report -m && coverage xml

133
README.md
View File

@ -1,25 +1,140 @@
### Building ### Setup
Build base image Build base image
```bash ```bash
setup/docker.sh docker build -f Dockerfile_base -t image-prediction-base .
``` docker build -f Dockerfile -t image-prediction .
Build head image
```bash
docker build -f Dockerfile -t image-prediction . --build-arg BASE_ROOT=""
``` ```
### Usage ### Usage
#### Without Docker
```bash
py scripts/run_pipeline.py /path/to/a/pdf
```
#### With Docker
Shell 1 Shell 1
```bash ```bash
docker run --rm --net=host --rm image-prediction docker run --rm --net=host image-prediction
``` ```
Shell 2 Shell 2
```bash ```bash
python scripts/pyinfra_mock.py --pdf_path /path/to/a/pdf python scripts/pyinfra_mock.py /path/to/a/pdf
``` ```
### Tests
Run for example this command to execute all tests and get a coverage report:
```bash
coverage run -m pytest test --tb=native -q -s -vvv -x && coverage combine && coverage report -m
```
After having built the service container as specified above, you can also run tests in a container as follows:
```bash
./run_tests.sh
```
### Message Body Formats
#### Request Format
The request messages need to provide the fields `"dossierId"` and `"fileId"`. A request should look like this:
```json
{
"dossierId": "<string identifier>",
"fileId": "<string identifier>"
}
```
Any additional keys are ignored.
#### Response Format
Response bodies contain information about the identified class of the image, the confidence of the classification, the
position and size of the image as well as the results of additional convenience filters which can be configured through
environment variables. A response body looks like this:
```json
{
"dossierId": "debug",
"fileId": "13ffa9851740c8d20c4c7d1706d72f2a",
"data": [...]
}
```
An image metadata record (entry in `"data"` field of a response body) looks like this:
```json
{
"classification": {
"label": "logo",
"probabilities": {
"logo": 1.0,
"signature": 1.1599173226749333e-17,
"other": 2.994595513398207e-23,
"formula": 4.352109377281029e-31
}
},
"position": {
"x1": 475.95,
"x2": 533.4,
"y1": 796.47,
"y2": 827.62,
"pageNumber": 6
},
"geometry": {
"width": 57.44999999999999,
"height": 31.149999999999977
},
"alpha": false,
"filters": {
"geometry": {
"imageSize": {
"quotient": 0.05975350599135938,
"tooLarge": false,
"tooSmall": false
},
"imageFormat": {
"quotient": 1.8443017656500813,
"tooTall": false,
"tooWide": false
}
},
"probability": {
"unconfident": false
},
"allPassed": true
}
}
```
## Configuration
A configuration file is located under `config.yaml`. All relevant variables can be configured via
exporting environment variables.
| __Environment Variable__ | Default | Description |
|------------------------------------|------------------------------------|----------------------------------------------------------------------------------------|
| __LOGGING_LEVEL_ROOT__ | "INFO" | Logging level for log file messages |
| __VERBOSE__ | *true* | Service prints document processing progress to stdout |
| __BATCH_SIZE__ | 16 | Number of images in memory simultaneously per service instance |
| __RUN_ID__ | "fabfb1f192c745369b88cab34471aba7" | The ID of the mlflow run to load the image classifier from |
| __MIN_REL_IMAGE_SIZE__ | 0.05 | Minimally permissible image size to page size ratio |
| __MAX_REL_IMAGE_SIZE__ | 0.75 | Maximally permissible image size to page size ratio |
| __MIN_IMAGE_FORMAT__ | 0.1 | Minimally permissible image width to height ratio |
| __MAX_IMAGE_FORMAT__ | 10 | Maximally permissible image width to height ratio |
See also: https://git.iqser.com/projects/RED/repos/helm/browse/redaction/templates/image-service-v2

View File

@ -73,8 +73,8 @@ public class PlanSpec {
project(), project(),
SERVICE_NAME, new BambooKey(SERVICE_KEY)) SERVICE_NAME, new BambooKey(SERVICE_KEY))
.description("Docker build for image-prediction.") .description("Docker build for image-prediction.")
// .variables() .stages(
.stages(new Stage("Build Stage") new Stage("Build Stage")
.jobs( .jobs(
new Job("Build Job", new BambooKey("BUILD")) new Job("Build Job", new BambooKey("BUILD"))
.tasks( .tasks(
@ -84,9 +84,6 @@ public class PlanSpec {
new VcsCheckoutTask() new VcsCheckoutTask()
.description("Checkout default repository.") .description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()), .checkoutItems(new CheckoutItem().defaultRepository()),
new VcsCheckoutTask()
.description("Checkout redai_image research repository.")
.checkoutItems(new CheckoutItem().repository("RR / redai_image").path("redai_image")),
new ScriptTask() new ScriptTask()
.description("Set config and keys.") .description("Set config and keys.")
.inlineBody("mkdir -p ~/.ssh\n" + .inlineBody("mkdir -p ~/.ssh\n" +
@ -102,7 +99,9 @@ public class PlanSpec {
.dockerConfiguration( .dockerConfiguration(
new DockerConfiguration() new DockerConfiguration()
.image("nexus.iqser.com:5001/infra/release_build:4.2.0") .image("nexus.iqser.com:5001/infra/release_build:4.2.0")
.volume("/var/run/docker.sock", "/var/run/docker.sock")), .volume("/var/run/docker.sock", "/var/run/docker.sock"))),
new Stage("Sonar Stage")
.jobs(
new Job("Sonar Job", new BambooKey("SONAR")) new Job("Sonar Job", new BambooKey("SONAR"))
.tasks( .tasks(
new CleanWorkingDirectoryTask() new CleanWorkingDirectoryTask()
@ -111,9 +110,6 @@ public class PlanSpec {
new VcsCheckoutTask() new VcsCheckoutTask()
.description("Checkout default repository.") .description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()), .checkoutItems(new CheckoutItem().defaultRepository()),
new VcsCheckoutTask()
.description("Checkout redai_image repository.")
.checkoutItems(new CheckoutItem().repository("RR / redai_image").path("redai_image")),
new ScriptTask() new ScriptTask()
.description("Set config and keys.") .description("Set config and keys.")
.inlineBody("mkdir -p ~/.ssh\n" + .inlineBody("mkdir -p ~/.ssh\n" +

View File

@ -10,10 +10,11 @@ python3 -m pip install --upgrade pip
pip install dvc pip install dvc
pip install 'dvc[ssh]' pip install 'dvc[ssh]'
echo "Pulling dvc data"
dvc pull dvc pull
echo "index-url = https://${bamboo_nexus_user}:${bamboo_nexus_password}@nexus.iqser.com/repository/python-combind/simple" >> pip.conf echo "index-url = https://${bamboo_nexus_user}:${bamboo_nexus_password}@nexus.iqser.com/repository/python-combind/simple" >> pip.conf
docker build -f Dockerfile_base -t nexus.iqser.com:5001/red/$SERVICE_NAME_BASE:${bamboo_version_tag} . docker build -f Dockerfile_base -t $SERVICE_NAME_BASE .
docker build -f Dockerfile -t nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag} --build-arg VERSION_TAG=${bamboo_version_tag} . docker build -f Dockerfile -t nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag} .
echo "${bamboo_nexus_password}" | docker login --username "${bamboo_nexus_user}" --password-stdin nexus.iqser.com:5001 echo "${bamboo_nexus_password}" | docker login --username "${bamboo_nexus_user}" --password-stdin nexus.iqser.com:5001
docker push nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag} docker push nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag}

View File

@ -6,11 +6,17 @@ export JAVA_HOME=/usr/bin/sonar-scanner/jre
python3 -m venv build_venv python3 -m venv build_venv
source build_venv/bin/activate source build_venv/bin/activate
python3 -m pip install --upgrade pip python3 -m pip install --upgrade pip
python3 -m pip install dependency-check
python3 -m pip install coverage
echo "dev setup for unit test and coverage 💖" echo "coverage report generation"
pip install -e . bash run_tests.sh
pip install -r requirements.txt
if [ ! -f reports/coverage.xml ]
then
exit 1
fi
SERVICE_NAME=$1 SERVICE_NAME=$1

11
banner.txt Normal file
View File

@ -0,0 +1,11 @@
+----------------------------------------------------+
| ___ |
| __/_ `. .-"""-. |
|_._ _,-'""`-._ \_,` | \-' / )`-')|
|(,-.`._,'( |\`-/| "") `"` \ ((`"` |
| `-.-' \ )-`( , o o) ___Y , .'7 /| |
| `- \`_`"'- (_,___/...-` (_/_/ |
| |
+----------------------------------------------------+
| Image Classification Service |
+----------------------------------------------------+

View File

@ -1,20 +1,18 @@
webserver: webserver:
host: $SERVER_HOST|"127.0.0.1" # webserver address host: $SERVER_HOST|"127.0.0.1" # webserver address
port: $SERVER_PORT|5000 # webserver port port: $SERVER_PORT|5000 # webserver port
mode: $SERVER_MODE|production # webserver mode: {development, production}
service: service:
logging_level: $LOGGING_LEVEL_ROOT|DEBUG # Logging level for service logger logging_level: $LOGGING_LEVEL_ROOT|INFO # Logging level for service logger
progressbar: True # Whether a progress bar over the pages of a document is displayed while processing
batch_size: $BATCH_SIZE|32 # Number of images in memory simultaneously
verbose: $VERBOSE|True # Service prints document processing progress to stdout verbose: $VERBOSE|True # Service prints document processing progress to stdout
run_id: $RUN_ID|fabfb1f192c745369b88cab34471aba7 # The ID of the mlflow run to load the model from batch_size: $BATCH_SIZE|16 # Number of images in memory simultaneously
mlflow_run_id: $MLFLOW_RUN_ID|fabfb1f192c745369b88cab34471aba7 # The ID of the mlflow run to load the service_estimator from
# These variables control filters that are applied to either images, image metadata or model predictions. The filter # These variables control filters that are applied to either images, image metadata or service_estimator predictions.
# result values are reported in the service responses. For convenience the response to a request contains a # The filter result values are reported in the service responses. For convenience the response to a request contains a
# "filters.allPassed" field, which is set to false if any of the filters returned values did not meet its specified # "filters.allPassed" field, which is set to false if any of the values returned by the filters did not meet its
# required value. # specified required value.
filters: filters:
image_to_page_quotient: # Image size to page size ratio (ratio of geometric means of areas) image_to_page_quotient: # Image size to page size ratio (ratio of geometric means of areas)

1
data/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
/mlruns

View File

@ -1,4 +0,0 @@
outs:
- md5: 6d0186c1f25e889d531788f168fa6cf0
size: 16727296
path: base_weights.h5

View File

@ -1,5 +1,5 @@
outs: outs:
- md5: d1c708270bab6fcd344d4a8b05d1103d.dir - md5: ad061d607f615afc149643f62dbf37cc.dir
size: 150225383 size: 166952700
nfiles: 178 nfiles: 179
path: mlruns path: mlruns

1
doc/tests.drawio Normal file
View File

@ -0,0 +1 @@
<mxfile host="app.diagrams.net" modified="2022-03-17T15:35:10.371Z" agent="5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" etag="b-CbBXg6FXQ9T3Px-oLc" version="17.1.1" type="device"><diagram id="tS3WR_Pr6QhNVK3FqSUP" name="Page-1">1ZZRT6QwEMc/DY8mQHdRX93z9JLbmNzGmNxbQ0daLQzpDrL46a/IsCzinneJcd0XaP+dtsN/fkADscg3V06WeokKbBCHahOIb0Ecnydzf22FphPmyXknZM6oTooGYWWegcWQ1cooWI8CCdGSKcdiikUBKY006RzW47B7tONdS5nBRFil0k7VO6NId+rZPBz0azCZ7neOQh7JZR/MwlpLhfWOJC4DsXCI1LXyzQJs613vSzfv+57RbWIOCvqXCZqW9PBref27aZ7xsQ5vTn/cnvAqT9JW/MCwJuNzR8dZU9Nb4bAqFLSrhYG4qLUhWJUybUdrX3uvacqt70W+yeuCI9jsTTja2uDxAcyBXONDeILonWN04hn366EQUR+jd4qQsCa59tl26cEe32CH/sOt+TueoCONGRbS/kQs2YkHIGoYbFkRvuUTqAmFr1zyu2LlUvhLdjG/HtJlQO/VfOq6AyvJPI3z+HAL4wlwpbp/2V0qODxzUTJmLjo4c8nEkxaWFXcLLPzt4ithKI4BQzHBMOc/l8UvAeLrj9/hQTw9NhBnxwDibB+IB+ZvdvZ5/PnucAx6Gds5S4rLPw==</diagram></mxfile>

View File

@ -0,0 +1,35 @@
from typing import List, Union, Tuple
import numpy as np
from PIL.Image import Image
from funcy import rcompose
from image_prediction.estimator.adapter.adapter import EstimatorAdapter
from image_prediction.label_mapper.mapper import LabelMapper
from image_prediction.utils import get_logger
logger = get_logger()
class Classifier:
def __init__(self, estimator_adapter: EstimatorAdapter, label_mapper: LabelMapper):
"""Abstraction layer over different estimator backends (e.g. keras or scikit-learn). For each backend to be used
an EstimatorAdapter must be implemented.
Args:
estimator_adapter: adapter for a given estimator backend
"""
self.__estimator_adapter = estimator_adapter
self.__label_mapper = label_mapper
self.__pipe = rcompose(self.__estimator_adapter, self.__label_mapper)
def predict(self, batch: Union[np.array, Tuple[Image]]) -> List[str]:
if isinstance(batch, np.ndarray) and batch.shape[0] == 0:
return []
return self.__pipe(batch)
def __call__(self, batch: np.array) -> List[str]:
logger.debug("Classifier.predict")
return self.predict(batch)

View File

@ -0,0 +1,32 @@
from itertools import chain
from typing import Iterable
from PIL.Image import Image
from funcy import rcompose, chunks
from image_prediction.classifier.classifier import Classifier
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
from image_prediction.estimator.preprocessor.preprocessors.identity import IdentityPreprocessor
from image_prediction.utils import get_logger
logger = get_logger()
class ImageClassifier:
"""Combines a classifier with a preprocessing pipeline: Receives images, chunks into batches, converts to tensors,
applies transformations and finally sends to internal classifier.
"""
def __init__(self, classifier: Classifier, preprocessor: Preprocessor = None):
self.estimator = classifier
self.preprocessor = preprocessor if preprocessor else IdentityPreprocessor()
self.pipe = rcompose(self.preprocessor, self.estimator)
def predict(self, images: Iterable[Image], batch_size=16):
batches = chunks(batch_size, images)
predictions = chain.from_iterable(map(self.pipe, batches))
return predictions
def __call__(self, images: Iterable[Image], batch_size=16):
logger.debug("ImageClassifier.predict")
yield from self.predict(images, batch_size=batch_size)

View File

View File

@ -0,0 +1,16 @@
from funcy import rcompose
from image_prediction.transformer.transformer import Transformer
from image_prediction.utils import get_logger
logger = get_logger()
class TransformerCompositor(Transformer):
def __init__(self, formatter: Transformer, *formatters: Transformer):
formatters = (formatter, *formatters)
self.pipe = rcompose(*formatters)
def transform(self, obj):
logger.debug("TransformerCompositor.transform")
return self.pipe(obj)

View File

@ -18,12 +18,12 @@ class DotIndexable:
def __getattr__(self, item): def __getattr__(self, item):
return _get_item_and_maybe_make_dotindexable(self.x, item) return _get_item_and_maybe_make_dotindexable(self.x, item)
def __setitem__(self, key, value):
self.x[key] = value
def __repr__(self): def __repr__(self):
return self.x.__repr__() return self.x.__repr__()
def __getitem__(self, item):
return self.__getattr__(item)
class Config: class Config:
def __init__(self, config_path): def __init__(self, config_path):

View File

@ -0,0 +1,38 @@
from funcy import juxt
from image_prediction.classifier.classifier import Classifier
from image_prediction.classifier.image_classifier import ImageClassifier
from image_prediction.compositor.compositor import TransformerCompositor
from image_prediction.estimator.adapter.adapter import EstimatorAdapter
from image_prediction.formatter.formatters.camel_case import Snake2CamelCaseKeyFormatter
from image_prediction.formatter.formatters.enum import EnumFormatter
from image_prediction.image_extractor.extractors.parsable import ParsablePDFImageExtractor
from image_prediction.label_mapper.mappers.probability import ProbabilityMapper
from image_prediction.model_loader.loader import ModelLoader
from image_prediction.model_loader.loaders.mlflow import MlflowConnector
from image_prediction.redai_adapter.mlflow import MlflowModelReader
from image_prediction.transformer.transformers.coordinate.pdfnet import PDFNetCoordinateTransformer
from image_prediction.transformer.transformers.response import ResponseTransformer
def get_mlflow_model_loader(mlruns_dir):
model_loader = ModelLoader(MlflowConnector(MlflowModelReader(mlruns_dir)))
return model_loader
def get_image_classifier(model_loader, model_identifier):
model, classes = juxt(model_loader.load_model, model_loader.load_classes)(model_identifier)
return ImageClassifier(Classifier(EstimatorAdapter(model), ProbabilityMapper(classes)))
def get_extractor(**kwargs):
image_extractor = ParsablePDFImageExtractor(**kwargs)
return image_extractor
def get_formatter():
formatter = TransformerCompositor(
PDFNetCoordinateTransformer(), EnumFormatter(), ResponseTransformer(), Snake2CamelCaseKeyFormatter()
)
return formatter

View File

View File

@ -0,0 +1,15 @@
from image_prediction.utils import get_logger
logger = get_logger()
class EstimatorAdapter:
def __init__(self, estimator):
self.estimator = estimator
def predict(self, batch):
return self.estimator(batch)
def __call__(self, batch):
logger.debug("EstimatorAdapter.predict")
return self.predict(batch)

View File

@ -0,0 +1,10 @@
import abc
class Preprocessor(abc.ABC):
@abc.abstractmethod
def preprocess(self, batch):
raise NotImplementedError
def __call__(self, batch):
return self.preprocess(batch)

View File

@ -0,0 +1,10 @@
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
from image_prediction.estimator.preprocessor.utils import images_to_batch_tensor
class BasicPreprocessor(Preprocessor):
"""Converts images to tensors"""
@staticmethod
def preprocess(images):
return images_to_batch_tensor(images)

View File

@ -0,0 +1,10 @@
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
class IdentityPreprocessor(Preprocessor):
@staticmethod
def preprocess(images):
return images
def __call__(self, images):
return self.preprocess(images)

View File

@ -0,0 +1,10 @@
import numpy as np
from PIL.Image import Image
def image_to_normalized_tensor(image: Image) -> np.ndarray:
return np.array(image) / 255
def images_to_batch_tensor(images) -> np.ndarray:
return np.array(list(map(image_to_normalized_tensor, images)))

View File

@ -0,0 +1,34 @@
class UnknownEstimatorAdapter(ValueError):
pass
class UnknownImageExtractor(ValueError):
pass
class UnknownModelLoader(ValueError):
pass
class UnknownDatabaseType(ValueError):
pass
class UnknownLabelFormat(ValueError):
pass
class UnexpectedLabelFormat(ValueError):
pass
class IncorrectInstantiation(RuntimeError):
pass
class IntentionalTestException(RuntimeError):
pass
class InvalidBox(Exception):
pass

View File

@ -0,0 +1,13 @@
from image_prediction.image_extractor.extractors.parsable import ParsablePDFImageExtractor
def extract_images_from_pdf(pdf, extractor=None):
if not extractor:
extractor = ParsablePDFImageExtractor()
try:
images_extracted, metadata_extracted = zip(*extractor(pdf))
return images_extracted, metadata_extracted
except ValueError:
return [], []

View File

@ -1,4 +1,5 @@
import multiprocessing import multiprocessing
import traceback
from typing import Callable from typing import Callable
from flask import Flask, request, jsonify from flask import Flask, request, jsonify
@ -8,8 +9,30 @@ from image_prediction.utils import get_logger
logger = get_logger() logger = get_logger()
def make_prediction_server(predict_fn: Callable): def run_in_process(func):
p = multiprocessing.Process(target=func)
p.start()
p.join()
def wrap_in_process(func_to_wrap):
def build_function_and_run_in_process(*args, **kwargs):
def func():
try:
result = func_to_wrap(*args, **kwargs)
return_dict["result"] = result
except:
logger.error(traceback.format_exc())
manager = multiprocessing.Manager()
return_dict = manager.dict()
run_in_process(func)
return return_dict.get("result", None)
return build_function_and_run_in_process
def make_prediction_server(predict_fn: Callable):
app = Flask(__name__) app = Flask(__name__)
@app.route("/ready", methods=["GET"]) @app.route("/ready", methods=["GET"])
@ -24,42 +47,28 @@ def make_prediction_server(predict_fn: Callable):
resp.status_code = 200 resp.status_code = 200
return resp return resp
def __failure():
response = jsonify("Analysis failed")
response.status_code = 500
return response
@app.route("/predict", methods=["POST"])
@app.route("/", methods=["POST"]) @app.route("/", methods=["POST"])
def predict(): def predict():
def predict_fn_wrapper(pdf, return_dict):
return_dict["result"] = predict_fn(pdf)
def process(): # Tensorflow does not free RAM. Workaround: Run prediction function (which instantiates a model) in sub-process.
# Tensorflow does not free RAM. Workaround is running model in process. # See: https://stackoverflow.com/questions/39758094/clearing-tensorflow-gpu-memory-after-model-execution
# https://stackoverflow.com/questions/39758094/clearing-tensorflow-gpu-memory-after-model-execution predict_fn_wrapped = wrap_in_process(predict_fn)
pdf = request.data
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = multiprocessing.Process(
target=predict_fn_wrapper,
args=(
pdf,
return_dict,
),
)
p.start()
p.join()
try:
return dict(return_dict)["result"]
except KeyError:
raise
logger.debug("Running predictor on document...") logger.info("Analysing...")
try: predictions = predict_fn_wrapped(request.data)
predictions = process()
if predictions:
response = jsonify(predictions) response = jsonify(predictions)
logger.info("Analysis completed.") logger.info("Analysis completed.")
return response return response
except Exception as err: else:
logger.error("Analysis failed.") logger.error("Analysis failed.")
logger.exception(err) return __failure()
response = jsonify("Analysis failed.")
response.status_code = 500
return response
return app return app

View File

View File

@ -0,0 +1,15 @@
import abc
from image_prediction.transformer.transformer import Transformer
class Formatter(Transformer):
@abc.abstractmethod
def format(self, obj):
raise NotImplementedError
def transform(self, obj):
raise NotImplementedError()
def __call__(self, obj):
return self.format(obj)

View File

@ -0,0 +1,11 @@
from image_prediction.formatter.formatters.key_formatter import KeyFormatter
class Snake2CamelCaseKeyFormatter(KeyFormatter):
def format_key(self, key):
if isinstance(key, str):
head, *tail = key.split("_")
return head + "".join(map(str.title, tail))
else:
return key

View File

@ -0,0 +1,23 @@
from enum import Enum
from image_prediction.formatter.formatters.key_formatter import KeyFormatter
class EnumFormatter(KeyFormatter):
def format_key(self, key):
return key.value if isinstance(key, Enum) else key
def transform(self, obj):
raise NotImplementedError
class ReverseEnumFormatter(KeyFormatter):
def __init__(self, enum):
self.enum = enum
self.reverse_enum = {e.value: e for e in enum}
def format_key(self, key):
return self.reverse_enum.get(key, key)
def transform(self, obj):
raise NotImplementedError

View File

@ -0,0 +1,6 @@
from image_prediction.formatter.formatter import Formatter
class IdentityFormatter(Formatter):
def format(self, obj):
return obj

View File

@ -0,0 +1,28 @@
import abc
from typing import Iterable
from image_prediction.formatter.formatter import Formatter
class KeyFormatter(Formatter):
@abc.abstractmethod
def format_key(self, key):
raise NotImplementedError
def __format(self, data):
# If we wanted to do this properly, we would need handlers for all expected types and dispatch based
# on a type comparison. This is too much engineering for the limited use-case of this class though.
if isinstance(data, Iterable) and not isinstance(data, dict) and not isinstance(data, str):
f = map(self.__format, data)
return type(data)(f) if not isinstance(data, map) else f
if not isinstance(data, dict):
return data
keys_formatted = list(map(self.format_key, data))
return dict(zip(keys_formatted, map(self.__format, data.values())))
def format(self, data):
return self.__format(data)

View File

@ -0,0 +1,19 @@
import abc
from collections import namedtuple
from typing import Iterable
from image_prediction.utils import get_logger
ImageMetadataPair = namedtuple("ImageMetadataPair", ["image", "metadata"])
logger = get_logger()
class ImageExtractor(abc.ABC):
@abc.abstractmethod
def extract(self, obj) -> Iterable[ImageMetadataPair]:
raise NotImplementedError
def __call__(self, obj, **kwargs):
logger.debug("ImageExtractor.extract")
return self.extract(obj, **kwargs)

View File

@ -0,0 +1,7 @@
from image_prediction.image_extractor.extractor import ImageExtractor, ImageMetadataPair
class ImageExtractorMock(ImageExtractor):
def extract(self, image_container):
for i, image in enumerate(image_container):
yield ImageMetadataPair(image, {"image_id": i})

View File

@ -0,0 +1,179 @@
import atexit
import io
from functools import partial, lru_cache
from itertools import chain, starmap, filterfalse
from operator import itemgetter
from typing import List
import fitz
from PIL import Image
from funcy import rcompose, merge, pluck, curry, compose
from image_prediction.image_extractor.extractor import ImageExtractor, ImageMetadataPair
from image_prediction.info import Info
from image_prediction.stitching.stitching import stitch_pairs
from image_prediction.stitching.utils import validate_box_coords, validate_box_size
from image_prediction.utils.generic import lift
class ParsablePDFImageExtractor(ImageExtractor):
def __init__(self, verbose=False, tolerance=0):
"""
Args:
verbose: Whether to show progressbar
tolerance: The tolerance in pixels for the distance images beyond which they will not be stitched together
"""
self.doc: fitz.fitz.Document = None
self.verbose = verbose
self.tolerance = tolerance
def extract(self, pdf: bytes, page_range: range = None):
self.doc = fitz.Document(stream=pdf)
pages = extract_pages(self.doc, page_range) if page_range else self.doc
image_metadata_pairs = chain.from_iterable(map(self.__process_images_on_page, pages))
yield from image_metadata_pairs
def __process_images_on_page(self, page: fitz.fitz.Page):
images = get_images_on_page(self.doc, page)
metadata = get_metadata_for_images_on_page(self.doc, page)
clear_caches()
image_metadata_pairs = starmap(ImageMetadataPair, filter(all, zip(images, metadata)))
image_metadata_pairs = stitch_pairs(list(image_metadata_pairs), tolerance=self.tolerance)
yield from image_metadata_pairs
def extract_pages(doc, page_range):
page_range = range(page_range.start + 1, page_range.stop + 1)
pages = map(doc.load_page, page_range)
yield from pages
@lru_cache(maxsize=None)
def get_images_on_page(doc, page: fitz.Page):
image_infos = get_image_infos(page)
xrefs = map(itemgetter("xref"), image_infos)
images = map(partial(xref_to_image, doc), xrefs)
yield from images
def get_metadata_for_images_on_page(doc, page: fitz.Page):
metadata = map(get_image_metadata, get_image_infos(page))
metadata = validate_coords_and_passthrough(metadata)
metadata = filter_out_tiny_images(metadata)
metadata = validate_size_and_passthrough(metadata)
metadata = add_page_metadata(page, metadata)
metadata = add_alpha_channel_info(doc, page, metadata)
yield from metadata
@lru_cache(maxsize=None)
def get_image_infos(page: fitz.Page) -> List[dict]:
return page.get_image_info(xrefs=True)
@lru_cache(maxsize=None)
def xref_to_image(doc, xref) -> Image:
maybe_image = load_image_handle_from_xref(doc, xref)
return Image.open(io.BytesIO(maybe_image["image"])) if maybe_image else None
def get_image_metadata(image_info):
x1, y1, x2, y2 = map(rounder, image_info["bbox"])
width = abs(x2 - x1)
height = abs(y2 - y1)
return {
Info.WIDTH: width,
Info.HEIGHT: height,
Info.X1: x1,
Info.X2: x2,
Info.Y1: y1,
Info.Y2: y2,
}
def validate_coords_and_passthrough(metadata):
yield from map(validate_box_coords, metadata)
def filter_out_tiny_images(metadata):
yield from filterfalse(tiny, metadata)
def validate_size_and_passthrough(metadata):
yield from map(validate_box_size, metadata)
def add_page_metadata(page, metadata):
yield from map(partial(merge, get_page_metadata(page)), metadata)
def add_alpha_channel_info(doc, page, metadata):
page_to_xrefs = compose(curry(pluck)("xref"), get_image_infos)
xref_to_alpha = partial(has_alpha_channel, doc)
page_to_alpha_value_per_image = compose(lift(xref_to_alpha), page_to_xrefs)
alpha_to_dict = compose(dict, lambda a: [(Info.ALPHA, a)])
page_to_alpha_mapping_per_image = compose(lift(alpha_to_dict), page_to_alpha_value_per_image)
metadata = starmap(merge, zip(page_to_alpha_mapping_per_image(page), metadata))
yield from metadata
@lru_cache(maxsize=None)
def load_image_handle_from_xref(doc, xref):
return doc.extract_image(xref)
rounder = rcompose(round, int)
def get_page_metadata(page):
page_width, page_height = map(rounder, page.mediabox_size)
return {
Info.PAGE_WIDTH: page_width,
Info.PAGE_HEIGHT: page_height,
Info.PAGE_IDX: page.number,
}
def has_alpha_channel(doc, xref):
maybe_image = load_image_handle_from_xref(doc, xref)
maybe_smask = maybe_image["smask"] if maybe_image else None
if maybe_smask:
return any([doc.extract_image(maybe_smask) is not None, bool(fitz.Pixmap(doc, maybe_smask).alpha)])
else:
return bool(fitz.Pixmap(doc, xref).alpha)
def tiny(metadata):
return metadata[Info.WIDTH] * metadata[Info.HEIGHT] <= 4
def clear_caches():
get_image_infos.cache_clear()
load_image_handle_from_xref.cache_clear()
get_images_on_page.cache_clear()
xref_to_image.cache_clear()
atexit.register(clear_caches)

14
image_prediction/info.py Normal file
View File

@ -0,0 +1,14 @@
from enum import Enum
class Info(Enum):
PAGE_WIDTH = "page_width"
PAGE_HEIGHT = "page_height"
PAGE_IDX = "page_idx"
WIDTH = "width"
HEIGHT = "height"
X1 = "x1"
X2 = "x2"
Y1 = "y1"
Y2 = "y2"
ALPHA = "alpha"

View File

@ -0,0 +1,10 @@
import abc
class LabelMapper(abc.ABC):
@abc.abstractmethod
def map_labels(self, items):
raise NotImplementedError
def __call__(self, items):
return self.map_labels(items)

View File

@ -0,0 +1,20 @@
from typing import Mapping, Iterable
from image_prediction.exceptions import UnexpectedLabelFormat
from image_prediction.label_mapper.mapper import LabelMapper
class IndexMapper(LabelMapper):
def __init__(self, labels: Mapping[int, str]):
self.__labels = labels
def __validate_index_label_format(self, index_label: int) -> None:
if not 0 <= index_label < len(self.__labels):
raise UnexpectedLabelFormat(f"Received index label '{index_label}' that has no associated string label.")
def __map_label(self, index_label: int) -> str:
self.__validate_index_label_format(index_label)
return self.__labels[index_label]
def map_labels(self, index_labels: Iterable[int]) -> Iterable[str]:
return map(self.__map_label, index_labels)

View File

@ -0,0 +1,39 @@
from enum import Enum
from operator import itemgetter
from typing import Mapping, Iterable
import numpy as np
from funcy import rcompose, rpartial
from image_prediction.exceptions import UnexpectedLabelFormat
from image_prediction.label_mapper.mapper import LabelMapper
class ProbabilityMapperKeys(Enum):
LABEL = "label"
PROBABILITIES = "probabilities"
class ProbabilityMapper(LabelMapper):
def __init__(self, labels: Mapping[int, str]):
self.__labels = labels
# String conversion in the middle due to floating point precision issues.
# See: https://stackoverflow.com/questions/56820/round-doesnt-seem-to-be-rounding-properly
self.__rounder = rcompose(rpartial(round, 4), str, float)
def __validate_array_label_format(self, probabilities: np.ndarray) -> None:
if not len(probabilities) == len(self.__labels):
raise UnexpectedLabelFormat(
f"Received fewer probabilities ({len(probabilities)}) than labels were passed ({len(self.__labels)})."
)
def __map_array(self, probabilities: np.ndarray) -> dict:
self.__validate_array_label_format(probabilities)
cls2prob = dict(
sorted(zip(self.__labels, list(map(self.__rounder, probabilities))), key=itemgetter(1), reverse=True)
)
most_likely = [*cls2prob][0]
return {ProbabilityMapperKeys.LABEL: most_likely, ProbabilityMapperKeys.PROBABILITIES: cls2prob}
def map_labels(self, probabilities: Iterable[np.ndarray]) -> Iterable[dict]:
return map(self.__map_array, probabilities)

View File

@ -1,10 +1,17 @@
from os import path """Defines constant paths relative to the module root path."""
MODULE_DIR = path.dirname(path.abspath(__file__)) from pathlib import Path
PACKAGE_ROOT_DIR = path.dirname(MODULE_DIR)
CONFIG_FILE = path.join(PACKAGE_ROOT_DIR, "config.yaml") MODULE_DIR = Path(__file__).resolve().parents[0]
DATA_DIR = path.join(PACKAGE_ROOT_DIR, "data") PACKAGE_ROOT_DIR = MODULE_DIR.parents[0]
MLRUNS_DIR = path.join(DATA_DIR, "mlruns")
BASE_WEIGHTS = path.join(DATA_DIR, "base_weights.h5") CONFIG_FILE = PACKAGE_ROOT_DIR / "config.yaml"
BANNER_FILE = PACKAGE_ROOT_DIR / "banner.txt"
DATA_DIR = PACKAGE_ROOT_DIR / "data"
MLRUNS_DIR = str(DATA_DIR / "mlruns")
TEST_DATA_DIR = PACKAGE_ROOT_DIR / "test" / "data"

View File

@ -0,0 +1,7 @@
import abc
class DatabaseConnector(abc.ABC):
@abc.abstractmethod
def get_object(self, identifier):
raise NotImplementedError

View File

@ -0,0 +1,9 @@
from image_prediction.model_loader.database.connector import DatabaseConnector
class DatabaseConnectorMock(DatabaseConnector):
def __init__(self, store: dict):
self.store = store
def get_object(self, identifier):
return self.store[identifier]

View File

@ -0,0 +1,18 @@
from functools import lru_cache
from image_prediction.model_loader.database.connector import DatabaseConnector
class ModelLoader:
def __init__(self, database_connector: DatabaseConnector):
self.database_connector = database_connector
@lru_cache(maxsize=None)
def __get_object(self, identifier):
return self.database_connector.get_object(identifier)
def load_model(self, identifier):
return self.__get_object(identifier)["model"]
def load_classes(self, identifier):
return self.__get_object(identifier)["classes"]

View File

@ -0,0 +1,10 @@
from image_prediction.model_loader.database.connector import DatabaseConnector
from image_prediction.redai_adapter.mlflow import MlflowModelReader
class MlflowConnector(DatabaseConnector):
def __init__(self, mlflow_reader: MlflowModelReader):
self.mlflow_reader = mlflow_reader
def get_object(self, run_id):
return self.mlflow_reader[run_id]

View File

@ -0,0 +1,64 @@
import os
from functools import partial
from itertools import chain, tee
from funcy import rcompose, first, compose, second, chunks, identity
from tqdm import tqdm
from image_prediction.config import CONFIG
from image_prediction.default_objects import get_formatter, get_mlflow_model_loader, get_image_classifier, get_extractor
from image_prediction.locations import MLRUNS_DIR
from image_prediction.utils.generic import lift, starlift
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
def load_pipeline(**kwargs):
model_loader = get_mlflow_model_loader(MLRUNS_DIR)
model_identifier = CONFIG.service.mlflow_run_id
pipeline = Pipeline(model_loader, model_identifier, **kwargs)
return pipeline
def parallel(*fs):
return lambda *args: (f(a) for f, a in zip(fs, args))
def star(f):
return lambda x: f(*x)
class Pipeline:
def __init__(self, model_loader, model_identifier, batch_size=16, verbose=True, **kwargs):
self.verbose = verbose
extract = get_extractor(**kwargs)
classifier = get_image_classifier(model_loader, model_identifier)
reformat = get_formatter()
split = compose(star(parallel(*map(lift, (first, second)))), tee)
classify = compose(chain.from_iterable, lift(classifier), partial(chunks, batch_size))
pairwise_apply = compose(star, parallel)
join = compose(starlift(lambda prd, mdt: {"classification": prd, **mdt}), star(zip))
# +>--classify--v
# --extract-->--split--| |--join-->reformat
# +>--identity--^
self.pipe = rcompose(
extract, # ... image-metadata-pairs as a stream
split, # ... into an image stream and a metadata stream
pairwise_apply(classify, identity), # ... apply functions to the streams pairwise
join, # ... the streams by zipping
reformat, # ... the items
)
def __call__(self, pdf: bytes, page_range: range = None):
yield from tqdm(
self.pipe(pdf, page_range=page_range),
desc="Processing images from document",
unit=" images",
disable=not self.verbose,
)

View File

@ -1,122 +0,0 @@
from itertools import chain
from operator import itemgetter
from typing import List, Dict, Iterable
import numpy as np
from image_prediction.config import CONFIG
from image_prediction.locations import MLRUNS_DIR, BASE_WEIGHTS
from image_prediction.utils import temporary_pdf_file, get_logger
from incl.redai_image.redai.redai.backend.model.model_handle import ModelHandle
from incl.redai_image.redai.redai.backend.pdf.image_extraction import extract_and_stitch
from incl.redai_image.redai.redai.utils.mlflow_reader import MlflowModelReader
from incl.redai_image.redai.redai.utils.shared import chunk_iterable
logger = get_logger()
class Predictor:
"""`ModelHandle` wrapper. Forwards to wrapped model handle for prediction and produces structured output that is
interpretable independently of the wrapped model (e.g. with regard to a .classes_ attribute).
"""
def __init__(self, model_handle: ModelHandle = None):
"""Initializes a ServiceEstimator.
Args:
model_handle: ModelHandle object to forward to for prediction. By default, a model handle is loaded from the
mlflow database via CONFIG.service.run_id.
"""
try:
if model_handle is None:
reader = MlflowModelReader(run_id=CONFIG.service.run_id, mlruns_dir=MLRUNS_DIR)
self.model_handle = reader.get_model_handle(BASE_WEIGHTS)
else:
self.model_handle = model_handle
self.classes = self.model_handle.model.classes_
self.classes_readable = np.array(self.model_handle.classes)
self.classes_readable_aligned = self.classes_readable[self.classes[list(range(len(self.classes)))]]
except Exception as e:
logger.info(f"Service estimator initialization failed: {e}")
def __make_predictions_human_readable(self, probs: np.ndarray) -> List[Dict[str, float]]:
"""Translates an n x m matrix of probabilities over classes into an n-element list of mappings from classes to
probabilities.
Args:
probs: probability matrix (items x classes)
Returns:
list of mappings from classes to probabilities.
"""
classes = np.argmax(probs, axis=1)
classes = self.classes[classes]
classes_readable = [self.model_handle.classes[c] for c in classes]
return classes_readable
def predict(self, images: List, probabilities: bool = False, **kwargs):
"""Gathers predictions for list of images. Assigns each image a class and optionally a probability distribution
over all classes.
Args:
images (List[PIL.Image]) : Images to gather predictions for.
probabilities: Whether to return dictionaries of the following form instead of strings:
{
"class": predicted class,
"probabilities": {
"class 1" : class 1 probability,
"class 2" : class 2 probability,
...
}
}
Returns:
By default the return value is a list of classes (meaningful class name strings). Alternatively a list of
dictionaries with an additional probability field for estimated class probabilities per image can be
returned.
"""
X = self.model_handle.prep_images(list(images))
probs_per_item = self.model_handle.model.predict_proba(X, **kwargs).astype(float)
classes = self.__make_predictions_human_readable(probs_per_item)
class2prob_per_item = [dict(zip(self.classes_readable_aligned, probs)) for probs in probs_per_item]
class2prob_per_item = [
dict(sorted(c2p.items(), key=itemgetter(1), reverse=True)) for c2p in class2prob_per_item
]
predictions = [{"class": c, "probabilities": c2p} for c, c2p in zip(classes, class2prob_per_item)]
return predictions if probabilities else classes
def predict_pdf(self, pdf, verbose=False):
with temporary_pdf_file(pdf) as pdf_path:
image_metadata_pairs = self.__extract_image_metadata_pairs(pdf_path, verbose=verbose)
return self.__predict_images(image_metadata_pairs)
def __predict_images(self, image_metadata_pairs: Iterable, batch_size: int = CONFIG.service.batch_size):
def process_chunk(chunk):
images, metadata = zip(*chunk)
predictions = self.predict(images, probabilities=True)
return predictions, metadata
def predict(image_metadata_pair_generator):
chunks = chunk_iterable(image_metadata_pair_generator, n=batch_size)
return map(chain.from_iterable, zip(*map(process_chunk, chunks)))
try:
predictions, metadata = predict(image_metadata_pairs)
return predictions, metadata
except ValueError:
return [], []
@staticmethod
def __extract_image_metadata_pairs(pdf_path: str, **kwargs):
def image_is_large_enough(metadata: dict):
x1, x2, y1, y2 = itemgetter("x1", "x2", "y1", "y2")(metadata)
return abs(x1 - x2) > 2 and abs(y1 - y2) > 2
yield from extract_and_stitch(pdf_path, convert_to_rgb=True, filter_fn=image_is_large_enough, **kwargs)

View File

@ -0,0 +1,45 @@
import tensorflow as tf
from image_prediction.redai_adapter.model_wrapper import ModelWrapper
class EfficientNetWrapper(ModelWrapper):
def __init__(self, classes, base_weights_path=None, weights_path=None):
self.__input_shape = (224, 224, 3)
super().__init__(classes=classes, base_weights_path=base_weights_path, weights_path=weights_path)
@property
def input_shape(self):
return self.__input_shape
def _ModelWrapper__preprocess_tensor(self, tensor):
return tf.keras.applications.efficientnet.preprocess_input(tensor)
def _ModelWrapper__build(self, base_weights=None) -> tf.keras.models.Model:
input_img = tf.keras.layers.Input(shape=self.input_shape)
pretrained = tf.keras.applications.efficientnet.EfficientNetB0(
include_top=False, input_tensor=tf.keras.layers.Input(shape=self.input_shape), weights=base_weights
)
pretrained.trainable = False
for layer in pretrained.layers:
layer.trainable = False
pretrained = pretrained(input_img)
finetuned = tf.keras.layers.Flatten()(pretrained)
finetuned = tf.keras.layers.Dense(512, activation="relu")(finetuned)
finetuned = tf.keras.layers.Dropout(0.2)(finetuned)
finetuned = tf.keras.layers.Dense(128, activation="relu")(finetuned)
finetuned = tf.keras.layers.Dropout(0.2)(finetuned)
finetuned = tf.keras.layers.Dense(32, activation="relu")(finetuned)
finetuned = tf.keras.layers.Dropout(0.2)(finetuned)
finetuned = tf.keras.layers.Dense(len(self.classes), activation="softmax")(finetuned)
model = tf.keras.models.Model(inputs=input_img, outputs=finetuned)
model.compile()
return model

View File

@ -0,0 +1,72 @@
import importlib
import json
import os
from functools import lru_cache
import mlflow
from image_prediction.redai_adapter.model import PredictionModelHandle
class MlflowModelReader:
def __init__(self, mlruns_dir=None):
self.mlruns_dir = mlruns_dir
mlflow.set_tracking_uri(self.mlruns_dir)
@staticmethod
def __correct_artifact_uri(run_artifact_uri, base_path):
_, suffix = run_artifact_uri.split("mlruns/")
return os.path.join(base_path, suffix)
def __get_weights_path(self, run_id, prefix="tt"):
run = self.__get_run(run_id)
artifact_uri = self.__correct_artifact_uri(run.info.to_proto().artifact_uri, self.mlruns_dir)
path = os.path.join(artifact_uri, prefix, "train_dev", "estimator")
base_path = os.path.join(path, "base_weights.h5")
weights_path = os.path.join(path, "weights.h5")
return base_path, weights_path
@lru_cache(maxsize=None)
def __get_run(self, run_id):
return mlflow.get_run(run_id)
def __get_classes(self, run_id, prefix="tt"):
run = self.__get_run(run_id)
classes = json.loads(run.data.params[os.path.join(prefix, "train_dev/estimator/classes")].replace("'", '"'))
return classes
def __get_model_handle(self, run_id):
run = self.__get_run(run_id)
model_handle_builder = load_object(run.data.params["model_handle_builder"].strip())
base_weights_path, weights_path = self.__get_weights_path(run_id)
model_handle = model_handle_builder(
self.__get_classes(run_id), base_weights_path=base_weights_path, weights_path=weights_path
)
return model_handle
def __get_model(self, run_id) -> PredictionModelHandle:
model_handle = self.__get_model_handle(run_id)
model = PredictionModelHandle(model_handle)
return model
def __getitem__(self, run_id):
return {"model": self.__get_model(run_id), "classes": self.__get_classes(run_id)}
def load_object(object_path):
path_fragments = object_path.split(".")
module_path = ".".join(path_fragments[:-1])
object_name = path_fragments[-1]
module = importlib.import_module(module_path)
return getattr(module, object_name)

View File

@ -0,0 +1,19 @@
from funcy import rcompose
from image_prediction.utils import get_logger
logger = get_logger()
class PredictionModelHandle:
"""Simplifies usage of ModelHandle instances for prediction purposes."""
def __init__(self, model_handle):
self.__predict = rcompose(model_handle.prep_images, model_handle.model.predict)
def predict(self, *args, **kwargs):
return self.__predict(*args, **kwargs)
def __call__(self, *args, **kwargs):
logger.debug("PredictionModelHandle.predict")
return self.predict(*args, **kwargs)

View File

@ -0,0 +1,42 @@
import abc
import numpy as np
import tensorflow as tf
class ModelWrapper(abc.ABC):
def __init__(self, classes, base_weights_path=None, weights_path=None):
self.__classes = classes
self.model = self.__build(base_weights_path)
self.model.load_weights(weights_path)
@property
@abc.abstractmethod
def input_shape(self):
raise NotImplementedError
@property
def classes(self):
return self.__classes
@abc.abstractmethod
def __preprocess_tensor(self, tensor):
raise NotImplementedError
@staticmethod
def __images_to_tensor(images):
return np.array(list(map(tf.keras.preprocessing.image.img_to_array, images)))
def __resize_and_convert(self, image):
return image.resize(self.input_shape[:-1]).convert("RGB")
def prep_images(self, images):
images = map(self.__resize_and_convert, images)
tensor = self.__images_to_tensor(images)
tensor = self.__preprocess_tensor(tensor)
return tensor
@abc.abstractmethod
def __build(self, base_weights=None) -> tf.keras.models.Model:
raise NotImplementedError

View File

View File

@ -0,0 +1,63 @@
from functools import lru_cache
from itertools import groupby
import numpy as np
from funcy import compose, second
from image_prediction.stitching.utils import make_coord_getter
class CoordGrouper:
def __init__(self, axis, tolerance=0):
self.c1_getter = make_coord_getter(f"{other_axis(axis)}1")
self.c2_getter = make_coord_getter(f"{other_axis(axis)}2")
self.tolerance = tolerance
def group_pairs_by_lesser_coordinate(self, pairs):
return group_by_coordinate(pairs, self.c1_getter, self.tolerance)
def group_pairs_by_greater_coordinate(self, pairs):
return group_by_coordinate(pairs, self.c2_getter, self.tolerance)
def other_axis(axis):
return "y" if axis == "x" else "x"
def fuzzify(func, tolerance):
def inner(item):
nonlocal mid_points
nonlocal lower_bounds
nonlocal upper_bounds
value = func(item)
fits = (array(lower_bounds_array()) <= value) & (value <= array(upper_bounds_array()))
if any(fits):
return mid_points[np.argmax(fits)]
else:
mid_points = [*mid_points, value]
lower_bounds = [*lower_bounds, value - tolerance]
upper_bounds = [*upper_bounds, value + tolerance]
return value
def lower_bounds_array():
return tuple(lower_bounds)
def upper_bounds_array():
return tuple(upper_bounds)
@lru_cache(maxsize=None)
def array(tpl):
return np.array(tpl)
lower_bounds = []
upper_bounds = []
mid_points = []
return inner
def group_by_coordinate(pairs, coord_getter, tolerance=0):
coord_getter = fuzzify(coord_getter, tolerance)
pairs = sorted(pairs, key=coord_getter)
return map(compose(list, second), groupby(pairs, coord_getter))

View File

@ -0,0 +1,189 @@
from copy import deepcopy
from functools import reduce
from typing import Iterable, Callable, List
from PIL import Image
from funcy import juxt, first, rest, rcompose, rpartial, complement, ilen
from image_prediction.image_extractor.extractor import ImageMetadataPair
from image_prediction.info import Info
from image_prediction.stitching.grouping import CoordGrouper
from image_prediction.stitching.split_mapper import HorizontalSplitMapper, VerticalSplitMapper
from image_prediction.stitching.utils import make_coord_getter, flatten_groups_once, validate_box
from image_prediction.utils.generic import until
def make_merger_sentinel():
def no_new_mergers(pairs):
nonlocal number_of_pairs_so_far
number_of_pairs_now = len(pairs)
if number_of_pairs_now == number_of_pairs_so_far:
return True
else:
number_of_pairs_so_far = number_of_pairs_now
return False
number_of_pairs_so_far = -1
return no_new_mergers
def merge_along_both_axes(pairs: Iterable[ImageMetadataPair], tolerance=0) -> List[ImageMetadataPair]:
pairs = merge_along_axis(pairs, "x", tolerance=tolerance)
pairs = list(merge_along_axis(pairs, "y", tolerance=tolerance))
return pairs
def merge_along_axis(pairs: Iterable[ImageMetadataPair], axis, tolerance=0) -> Iterable[ImageMetadataPair]:
"""Partially merges image-metadata pairs of adjacent images along a given axis. Needs to be iterated with
alternating axes until no more merges happen to merge all adjacent images.
Explanation:
Merging algorithm works as follows:
A dot represents a pair, a bracket a group and a colon a merged pair.
1) Start with pairs: (........)
2) Align on lesser: ([....] [....])
3) Align on greater: ([[..] [..]] [[....]])
4) Flatten once: ([..] [..] [....])
5) Merge orthogonally: ([:] [..] [:..])
6) Flatten once: (:..:..)
"""
def group_pairs_within_groups_by_greater_coordinate(groups):
return map(CoordGrouper(axis, tolerance=tolerance).group_pairs_by_greater_coordinate, groups)
def merge_groups_along_orthogonal_axis(groups):
return map(rpartial(make_group_merger(axis), tolerance), groups)
def group_pairs_by_lesser_coordinate(pairs):
return CoordGrouper(axis, tolerance=tolerance).group_pairs_by_lesser_coordinate(pairs)
return rcompose(
group_pairs_by_lesser_coordinate,
group_pairs_within_groups_by_greater_coordinate,
flatten_groups_once,
merge_groups_along_orthogonal_axis,
flatten_groups_once,
)(pairs)
def make_group_merger(axis):
return {"y": merge_group_vertically, "x": merge_group_horizontally}[axis]
def merge_group_vertically(group: Iterable[ImageMetadataPair], tolerance=0):
return merge_group(group, "y", tolerance=tolerance)
def merge_group_horizontally(group: Iterable[ImageMetadataPair], tolerance=0):
return merge_group(group, "x", tolerance=tolerance)
def merge_group(group: Iterable[ImageMetadataPair], direction, tolerance=0):
reduce_group = make_merger_aggregator(direction, tolerance=tolerance)
no_new_mergers = make_merger_sentinel()
return until(no_new_mergers, reduce_group, group)
def make_merger_aggregator(axis, tolerance=0) -> Callable[[Iterable[ImageMetadataPair]], Iterable[ImageMetadataPair]]:
"""Produces a function f : [H, T1, ... Tn] -> [HTi...Tj, Tk ... Tl] that merges adjacent image-metadata pairs on the
head H and aggregates non-adjacent in the tail T.
Note:
When tolerance > 0, the bounding box of the merged image no longer matches the bounding box of the mereged
metadata. This is intended behaviour, but might be not be expected by the caller.
"""
def merger_aggregator(pairs: Iterable[ImageMetadataPair]):
def merge_on_head_and_aggregate_in_tail(pairs_aggr: Iterable[ImageMetadataPair], pair: ImageMetadataPair):
"""Keeps the image that is being merged with as the head and aggregates non-mergables in the tail."""
aggr, non_aggr = juxt(first, rest)(pairs_aggr)
if abs(c2_getter(aggr) - c1_getter(pair)) <= tolerance:
aggr = pair_merger(aggr, pair)
return aggr, *non_aggr
else:
return aggr, pair, *non_aggr
# Requires H to be the least element in image-concatenation direction by c1, since the concatenation happens
# only in c1 -> c2 direction.
pairs = sorted(pairs, key=c1_getter)
head_pair, pairs = juxt(first, rest)(pairs)
return list(reduce(merge_on_head_and_aggregate_in_tail, pairs, [head_pair]))
assert tolerance >= 0
c1_getter = make_coord_getter(f"{axis}1")
c2_getter = make_coord_getter(f"{axis}2")
pair_merger = make_pair_merger(axis)
return merger_aggregator
def make_pair_merger(axis):
return {"y": merge_pair_vertically, "x": merge_pair_horizontally}[axis]
def merge_pair_vertically(p1: ImageMetadataPair, p2: ImageMetadataPair):
metadata_merged = merge_metadata_vertically(p1.metadata, p2.metadata)
image_concatenated = concat_images_vertically(p1.image, p2.image, metadata_merged)
return ImageMetadataPair(image_concatenated, metadata_merged)
def merge_pair_horizontally(p1: ImageMetadataPair, p2: ImageMetadataPair):
metadata_merged = merge_metadata_horizontally(p1.metadata, p2.metadata)
image_concatenated = concat_images_horizontally(p1.image, p2.image, metadata_merged)
return ImageMetadataPair(image_concatenated, metadata_merged)
def merge_metadata_vertically(m1: dict, m2: dict):
m1, m2 = map(VerticalSplitMapper, [m1, m2])
return merge_metadata(m1, m2)
def merge_metadata_horizontally(m1: dict, m2: dict):
m1, m2 = map(HorizontalSplitMapper, [m1, m2])
return merge_metadata(m1, m2)
def merge_metadata(m1: dict, m2: dict):
c1 = min(m1.c1, m2.c1)
c2 = max(m1.c2, m2.c2)
dim = abs(c2 - c1)
merged = deepcopy(m1)
merged.dim = dim
merged.c1 = c1
merged.c2 = c2
validate_box(merged.wrapped)
return merged.wrapped
def concat_images_vertically(im1: Image, im2: Image, metadata: dict):
return concat_images(im1, im2, metadata, 1)
def concat_images_horizontally(im1: Image, im2: Image, metadata: dict):
return concat_images(im1, im2, metadata, 0)
def concat_images(im1: Image, im2: Image, metadata: dict, axis):
im_aggr = Image.new(im1.mode, (metadata[Info.WIDTH], metadata[Info.HEIGHT]))
images = [im1, im2]
offsets = 0, im1.size[axis], im_aggr.size[axis] - im2.size[axis]
for im, offset in zip(images, offsets):
box = (offset, 0) if not axis else (0, offset)
im_aggr.paste(im, box=box)
return im_aggr

View File

@ -0,0 +1,40 @@
from copy import deepcopy
from dataclasses import field, dataclass
from operator import attrgetter
from image_prediction.info import Info
@dataclass
class SplitMapper:
"""Manages access into a mapping M by indirection through a specified access mapping to achieve a common
interface between various M_i.
"""
__access_mapping: dict
wrapped: dict
__wrapped: dict = field(init=False)
def __post_init__(self):
for k, v in self.__access_mapping.items():
setattr(self, k, self.__wrapped[v])
@property
def wrapped(self):
ret = deepcopy(self.__wrapped)
ret.update(dict(zip(self.__access_mapping.values(), attrgetter(*self.__access_mapping.keys())(self))))
return ret
@wrapped.setter
def wrapped(self, wrapped):
self.__wrapped = wrapped
class HorizontalSplitMapper(SplitMapper):
def __init__(self, wrapped: dict):
super().__init__({"dim": Info.WIDTH, "c1": Info.X1, "c2": Info.X2}, wrapped)
class VerticalSplitMapper(SplitMapper):
def __init__(self, wrapped: dict):
super().__init__({"dim": Info.HEIGHT, "c1": Info.Y1, "c2": Info.Y2}, wrapped)

View File

@ -0,0 +1,15 @@
from typing import Iterable, List
from funcy import rpartial
from image_prediction.image_extractor.extractor import ImageMetadataPair
from image_prediction.stitching.merging import merge_along_both_axes, make_merger_sentinel
from image_prediction.utils.generic import until
def stitch_pairs(pairs: Iterable[ImageMetadataPair], tolerance=0) -> List[ImageMetadataPair]:
"""Given a collection of image-metadata pairs from the same pages, combines all pairs that constitute adjacent
images."""
no_new_mergers = make_merger_sentinel()
merge = rpartial(merge_along_both_axes, tolerance)
return until(no_new_mergers, merge, pairs)

View File

@ -0,0 +1,67 @@
import json
from itertools import chain
from image_prediction.exceptions import InvalidBox
from image_prediction.formatter.formatters.enum import EnumFormatter
from image_prediction.info import Info
def flatten_groups_once(groups):
return chain.from_iterable(groups)
def make_coord_getter(c):
return {
"x1": make_getter(Info.X1),
"x2": make_getter(Info.X2),
"y1": make_getter(Info.Y1),
"y2": make_getter(Info.Y2),
}[c]
def make_getter(key):
def getter(pair):
return pair.metadata[key]
return getter
def make_length_getter(dim):
return {
"width": make_getter(Info.WIDTH),
"height": make_getter(Info.HEIGHT),
}[dim]
def validate_box(box):
validate_box_coords(box)
validate_box_size(box)
return box
def validate_box_coords(box):
x_diff = box[Info.WIDTH] - (box[Info.X2] - box[Info.X1])
y_diff = box[Info.HEIGHT] - (box[Info.Y2] - box[Info.Y1])
if x_diff:
raise InvalidBox(f"Width and x-coordinates differ by {x_diff} units: {format_box(box)}")
if y_diff:
raise InvalidBox(f"Width and y-coordinates differ by {y_diff} units: {format_box(box)}")
return box
def validate_box_size(box):
if not box[Info.WIDTH]:
raise InvalidBox(f"Zero width box: {format_box(box)}")
if not box[Info.HEIGHT]:
raise InvalidBox(f"Zero height box: {format_box(box)}")
return box
def format_box(box):
return json.dumps(EnumFormatter()(box), indent=2)

View File

View File

@ -0,0 +1,20 @@
import abc
from typing import Iterable
from funcy import curry, identity
class Transformer(abc.ABC):
@abc.abstractmethod
def transform(self, obj):
raise NotImplementedError
def __call__(self, obj):
return self._apply(self.transform, obj)
@staticmethod
def _must_be_mapped_over(obj):
return isinstance(obj, Iterable) and not isinstance(obj, dict)
def _apply(self, func, obj):
return (curry(map) if self._must_be_mapped_over(obj) else identity)(func)(obj)

View File

@ -0,0 +1,22 @@
import abc
from image_prediction.transformer.transformer import Transformer
class CoordinateTransformer(Transformer):
@abc.abstractmethod
def _forward(self, metadata):
raise NotImplementedError
@abc.abstractmethod
def _backward(self, metadata):
raise NotImplementedError
def forward(self, metadata):
return self._apply(self._forward, metadata)
def backward(self, metadata):
return self._apply(self._backward, metadata)
def transform(self, metadata):
return self.forward(metadata)

View File

@ -0,0 +1,10 @@
from image_prediction.transformer.transformers.coordinate.coordinate_transformer import CoordinateTransformer
class FitzCoordinateTransformer(CoordinateTransformer):
def _forward(self, metadata: dict):
"""Fitz uses top left corner as origin; we take this as the reference coordinate system."""
return metadata
def _backward(self, metadata: dict):
return self.forward(metadata)

View File

@ -0,0 +1,10 @@
from image_prediction.transformer.transformers.coordinate.coordinate_transformer import CoordinateTransformer
class FPDFCoordinateTransformer(CoordinateTransformer):
def _forward(self, metadata: dict):
"""FPDF uses top left corner as origin; we take this as the reference coordinate system."""
return metadata
def _backward(self, metadata: dict):
return self.forward(metadata)

View File

@ -0,0 +1,18 @@
from operator import itemgetter
from funcy import omit
from image_prediction.info import Info
from image_prediction.transformer.transformers.coordinate.coordinate_transformer import CoordinateTransformer
class PDFNetCoordinateTransformer(CoordinateTransformer):
def _forward(self, metadata: dict):
"""PDFNet coordinate system origin is in the bottom left corner."""
y1, y2, page_height = itemgetter(Info.Y1, Info.Y2, Info.PAGE_HEIGHT)(metadata)
y1_t = page_height - y2
y2_t = page_height - y1
return {**omit(metadata, [Info.Y1, Info.Y2]), **{Info.Y1: y1_t, Info.Y2: y2_t}}
def _backward(self, metadata: dict):
return self.forward(metadata)

View File

@ -1,28 +1,30 @@
"""Defines functions for constructing service responses."""
import math import math
from itertools import starmap
from operator import itemgetter from operator import itemgetter
from image_prediction.config import CONFIG from image_prediction.config import CONFIG
from image_prediction.transformer.transformer import Transformer
from image_prediction.utils import get_logger
logger = get_logger()
def build_response(predictions: list, metadata: list) -> list: class ResponseTransformer(Transformer):
return list(starmap(build_image_info, zip(predictions, metadata))) def transform(self, data):
logger.debug("ResponseTransformer.transform")
return build_image_info(data)
def build_image_info(prediction: dict, metadata: dict) -> dict: def build_image_info(data: dict) -> dict:
def compute_geometric_quotient(): def compute_geometric_quotient():
page_area_sqrt = math.sqrt(abs(page_width * page_height)) page_area_sqrt = math.sqrt(abs(page_width * page_height))
image_area_sqrt = math.sqrt(abs(x2 - x1) * abs(y2 - y1)) image_area_sqrt = math.sqrt(abs(x2 - x1) * abs(y2 - y1))
return image_area_sqrt / page_area_sqrt return image_area_sqrt / page_area_sqrt
page_width, page_height, x1, x2, y1, y2, width, height = itemgetter( page_width, page_height, x1, x2, y1, y2, width, height, alpha = itemgetter(
"page_width", "page_height", "x1", "x2", "y1", "y2", "width", "height" "page_width", "page_height", "x1", "x2", "y1", "y2", "width", "height", "alpha"
)(metadata) )(data)
quotient = compute_geometric_quotient() quotient = round(compute_geometric_quotient(), 4)
min_image_to_page_quotient_breached = bool(quotient < CONFIG.filters.image_to_page_quotient.min) min_image_to_page_quotient_breached = bool(quotient < CONFIG.filters.image_to_page_quotient.min)
max_image_to_page_quotient_breached = bool(quotient > CONFIG.filters.image_to_page_quotient.max) max_image_to_page_quotient_breached = bool(quotient > CONFIG.filters.image_to_page_quotient.max)
@ -33,14 +35,15 @@ def build_image_info(prediction: dict, metadata: dict) -> dict:
width / height > CONFIG.filters.image_width_to_height_quotient.max width / height > CONFIG.filters.image_width_to_height_quotient.max
) )
min_confidence_breached = bool(max(prediction["probabilities"].values()) < CONFIG.filters.min_confidence) classification = data["classification"]
prediction["label"] = prediction.pop("class") # "class" as field name causes problem for Java objectmapper
prediction["probabilities"] = {klass: round(prob, 6) for klass, prob in prediction["probabilities"].items()} min_confidence_breached = bool(max(classification["probabilities"].values()) < CONFIG.filters.min_confidence)
image_info = { image_info = {
"classification": prediction, "classification": classification,
"position": {"x1": x1, "x2": x2, "y1": y1, "y2": y2, "pageNumber": metadata["page_idx"] + 1}, "position": {"x1": x1, "x2": x2, "y1": y1, "y2": y2, "pageNumber": data["page_idx"] + 1},
"geometry": {"width": width, "height": height}, "geometry": {"width": width, "height": height},
"alpha": alpha,
"filters": { "filters": {
"geometry": { "geometry": {
"imageSize": { "imageSize": {
@ -49,7 +52,7 @@ def build_image_info(prediction: dict, metadata: dict) -> dict:
"tooSmall": min_image_to_page_quotient_breached, "tooSmall": min_image_to_page_quotient_breached,
}, },
"imageFormat": { "imageFormat": {
"quotient": width / height, "quotient": round(width / height, 4),
"tooTall": min_image_width_to_height_quotient_breached, "tooTall": min_image_width_to_height_quotient_breached,
"tooWide": max_image_width_to_height_quotient_breached, "tooWide": max_image_width_to_height_quotient_breached,
}, },

View File

@ -1,68 +1,3 @@
import logging
import tempfile
from contextlib import contextmanager
from image_prediction.config import CONFIG
@contextmanager
def temporary_pdf_file(pdf: bytes):
with tempfile.NamedTemporaryFile() as f:
f.write(pdf)
yield f.name
def make_logger_getter():
logger = logging.getLogger("imclf")
logger.propagate = False
handler = logging.StreamHandler()
handler.setLevel(CONFIG.service.logging_level)
log_format = "[%(levelname)s]: %(message)s"
formatter = logging.Formatter(log_format)
handler.setFormatter(formatter)
logger.addHandler(handler)
def get_logger():
return logger
return get_logger
get_logger = make_logger_getter()
def show_banner():
banner = '''
..... . ... ..
.d88888Neu. 'L xH88"`~ .x8X x .d88" oec :
F""""*8888888F .. . : :8888 .f"8888Hf 5888R @88888
* `"*88*" .888: x888 x888. :8888> X8L ^""` '888R 8"*88%
-.... ue=:. ~`8888~'888X`?888f` X8888 X888h 888R 8b.
:88N ` X888 888X '888> 88888 !88888. 888R u888888>
9888L X888 888X '888> 88888 %88888 888R 8888R
uzu. `8888L X888 888X '888> 88888 '> `8888> 888R 8888P
,""888i ?8888 X888 888X '888> `8888L % ?888 ! 888R *888>
4 9888L %888> "*88%""*88" '888!` `8888 `-*"" / .888B . 4888
' '8888 '88% `~ " `"` "888. :" ^*888% '888
"*8Nu.z*" `""***~"` "% 88R
88>
48
'8
'''
logger = logging.getLogger(__name__)
logger.propagate = False
handler = logging.StreamHandler()
handler.setLevel(logging.INFO)
formatter = logging.Formatter("")
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.info(banner)

View File

@ -0,0 +1 @@
from .logger import get_logger

View File

@ -0,0 +1,21 @@
import logging
from image_prediction.locations import BANNER_FILE
def show_banner():
with open(BANNER_FILE) as f:
banner = "\n" + "".join(f.readlines()) + "\n"
logger = logging.getLogger(__name__)
logger.propagate = False
handler = logging.StreamHandler()
handler.setLevel(logging.INFO)
formatter = logging.Formatter("")
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.info(banner)

View File

@ -0,0 +1,15 @@
from itertools import starmap
from funcy import iterate, first, curry, map
def until(cond, func, *args, **kwargs):
return first(filter(cond, iterate(func, *args, **kwargs)))
def lift(fn):
return curry(map)(fn)
def starlift(fn):
return curry(starmap)(fn)

View File

@ -0,0 +1,27 @@
import logging
from image_prediction.config import CONFIG
def make_logger_getter():
logger = logging.getLogger("imclf")
logger.propagate = False
handler = logging.StreamHandler()
handler.setLevel(CONFIG.service.logging_level)
log_format = "%(asctime)s %(levelname)-8s %(message)s"
formatter = logging.Formatter(log_format, datefmt="%Y-%m-%d %H:%M:%S")
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(CONFIG.service.logging_level)
def get_logger():
return logger
return get_logger
get_logger = make_logger_getter()

View File

@ -0,0 +1,99 @@
"""Defines utilities for PDF processing."""
import json
from operator import itemgetter
from PDFNetPython3.PDFNetPython import (
PDFDoc,
PDFNet,
Square,
Rect,
ColorPt,
BorderStyle,
SDFDoc,
Point,
Text,
)
from image_prediction.utils import get_logger
logger = get_logger()
def annotate_image(doc, image_info):
def draw_box():
sq = Square.Create(doc.GetSDFDoc(), Rect(*coords))
sq.SetColor(ColorPt(*color), 3)
sq.SetBorderStyle(BorderStyle(BorderStyle.e_dashed, 2, 0, 0, [4, 2]))
sq.SetPadding(4)
sq.RefreshAppearance()
page.AnnotPushBack(sq)
def add_note():
txt = Text.Create(doc.GetSDFDoc(), Point(*coords[:2]))
txt.SetContents(json.dumps(image_info, indent=2, ensure_ascii=False))
txt.SetColor(ColorPt(*color))
page.AnnotPushBack(txt)
txt.RefreshAppearance()
red = (1, 0, 0)
green = (0, 1, 0)
blue = (0, 0, 1)
if image_info["filters"]["allPassed"]:
color = green
elif image_info["filters"]["probability"]["unconfident"]:
color = red
else:
color = blue
page = doc.GetPage(image_info["position"]["pageNumber"])
coords = itemgetter("x1", "y1", "x2", "y2")(image_info["position"])
draw_box()
add_note()
def init():
PDFNet.Initialize(
"Knecon AG(en.knecon.swiss):OEM:DDA-R::WL+:AMS(20211029):BECC974307DAB4F34B513BC9B2531B24496F6FCB83CD8AC574358A959730B622FABEF5C7"
)
def draw_metadata_box(pdf_path, metadata, store_path):
init()
doc = PDFDoc(pdf_path)
color = (1, 0, 0)
print(metadata)
coords = itemgetter("x1", "y1", "x2", "y2")(metadata)
page = doc.GetPage(1)
sq = Square.Create(doc.GetSDFDoc(), Rect(*coords))
sq.SetColor(ColorPt(*color), 3)
sq.SetBorderStyle(BorderStyle(BorderStyle.e_dashed, 2, 0, 0, [4, 2]))
sq.SetPadding(4)
sq.RefreshAppearance()
page.AnnotPushBack(sq)
doc.Save(store_path, SDFDoc.e_linearized)
logger.info(f"Saved annotated PDF to {store_path}")
def annotate_pdf(pdf_path, responses, store_path):
init()
doc = PDFDoc(pdf_path)
for image_info in responses:
annotate_image(doc, image_info)
doc.Save(store_path, SDFDoc.e_linearized)
logger.info(f"Saved annotated PDF to {store_path}")

@ -1 +0,0 @@
Subproject commit 4c3b26d7673457aaa99e0663dad6950cd36da967

View File

@ -1,2 +1,5 @@
[pytest] [pytest]
norecursedirs = incl norecursedirs = incl
filterwarnings =
ignore:.*:DeprecationWarning
ignore:.*:DeprecationWarning

View File

@ -1,23 +1,23 @@
Flask==2.0.2 Flask==2.1.1
requests==2.27.1 requests==2.27.1
iteration-utilities==0.11.0 iteration-utilities==0.11.0
dvc==2.9.3 dvc==2.10.0
dvc[ssh] dvc[ssh]
frozendict==2.3.0 waitress==2.1.1
waitress==2.0.0 envyaml==1.10.211231
envyaml~=1.8.210417
dependency-check==0.6.* dependency-check==0.6.*
envyaml~=1.8.210417 mlflow==1.24.0
mlflow~=1.20.2 numpy==1.22.3
numpy~=1.19.3 tqdm==4.64.0
PDFNetPython3~=9.1.0 pandas==1.4.2
tqdm~=4.62.2 tensorflow==2.8.0
pandas~=1.3.1 PyYAML==6.0
mlflow~=1.20.2 pytest~=7.1.0
tensorflow~=2.5.0 funcy==1.17
PDFNetPython3~=9.1.0 PyMuPDF==1.19.6
Pillow~=8.3.2 fpdf==1.7.2
PyYAML~=5.4.1 coverage==6.3.2
scikit_learn~=0.24.2 Pillow==9.1.0
PDFNetPython3==9.1.0
pytest~=7.1.0 pdf2image==1.16.0
frozendict==2.3.0

15
run_tests.sh Executable file
View File

@ -0,0 +1,15 @@
echo "${bamboo_nexus_password}" | docker login --username "${bamboo_nexus_user}" --password-stdin nexus.iqser.com:5001
pip install dvc
pip install 'dvc[ssh]'
echo "Pulling dvc data"
dvc pull
docker build -f Dockerfile_tests -t image-prediction-tests .
rnd=$(date +"%s")
name=image-prediction-tests-${rnd}
echo "running tests container"
docker run --rm --name $name -v $PWD:$PWD -w $PWD -v /var/run/docker.sock:/var/run/docker.sock image-prediction-tests

View File

@ -40,7 +40,7 @@ def make_predict_fn():
model = make_model() model = make_model()
def predict(*args): def predict(*args):
# model = make_model() # service_estimator = make_model()
return model.predict(np.random.random(size=(1, 784))) return model.predict(np.random.random(size=(1, 784)))
return predict return predict

View File

@ -6,7 +6,7 @@ import requests
def parse_args(): def parse_args():
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("--pdf_path", required=True) parser.add_argument("pdf_path")
args = parser.parse_args() args = parser.parse_args()
return args return args

55
scripts/run_pipeline.py Normal file
View File

@ -0,0 +1,55 @@
import argparse
import json
import os
from glob import glob
from image_prediction.pipeline import load_pipeline
from image_prediction.utils import get_logger
from image_prediction.utils.pdf_annotation import annotate_pdf
logger = get_logger()
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("input", help="pdf file or directory")
parser.add_argument("--print", "-p", help="print output to terminal", action="store_true", default=False)
parser.add_argument("--page_interval", "-i", help="page interval [i, j), min index = 0", nargs=2, type=int)
args = parser.parse_args()
return args
def process_pdf(pipeline, pdf_path, page_range=None):
with open(pdf_path, "rb") as f:
logger.info(f"Processing {pdf_path}")
predictions = list(pipeline(f.read(), page_range=page_range))
annotate_pdf(
pdf_path, predictions, os.path.join("/tmp", os.path.basename(pdf_path.replace(".pdf", "_annotated.pdf")))
)
return predictions
def main(args):
pipeline = load_pipeline(verbose=True, tolerance=3)
if os.path.isfile(args.input):
pdf_paths = [args.input]
else:
pdf_paths = glob(os.path.join(args.input, "*.pdf"))
page_range = range(*args.page_interval) if args.page_interval else None
for pdf_path in pdf_paths:
predictions = process_pdf(pipeline, pdf_path, page_range=page_range)
if args.print:
print(pdf_path)
print(json.dumps(predictions, indent=2))
if __name__ == "__main__":
args = parse_args()
main(args)

View File

@ -1,15 +0,0 @@
#!/bin/bash
set -e
python3 -m venv build_venv
source build_venv/bin/activate
python3 -m pip install --upgrade pip
pip install dvc
pip install 'dvc[ssh]'
dvc pull
git submodule update --init --recursive
docker build -f Dockerfile_base -t image-prediction-base .
docker build -f Dockerfile -t image-prediction .

View File

@ -4,45 +4,29 @@ from waitress import serve
from image_prediction.config import CONFIG from image_prediction.config import CONFIG
from image_prediction.flask import make_prediction_server from image_prediction.flask import make_prediction_server
from image_prediction.predictor import Predictor from image_prediction.pipeline import load_pipeline
from image_prediction.response import build_response from image_prediction.utils import get_logger
from image_prediction.utils import get_logger, show_banner from image_prediction.utils.banner import show_banner
logger = get_logger()
def main(): def main():
def predict(pdf): logger = get_logger()
# Keras model.predict stalls when model was loaded in different process
# https://stackoverflow.com/questions/42504669/keras-tensorflow-and-multiprocessing-in-python
predictor = Predictor()
predictions, metadata = predictor.predict_pdf(pdf, verbose=CONFIG.service.progressbar)
response = build_response(predictions, metadata)
return response
logger.info("Predictor ready.") def predict(pdf):
# Keras service_estimator.predict stalls when service_estimator was loaded in different process;
# therefore, we re-load the model (part of the pipeline) every time we process a new document.
# https://stackoverflow.com/questions/42504669/keras-tensorflow-and-multiprocessing-in-python
logger.debug("Loading pipeline...")
pipeline = load_pipeline(verbose=CONFIG.service.verbose, batch_size=CONFIG.service.batch_size)
logger.debug("Running pipeline...")
return list(pipeline(pdf))
prediction_server = make_prediction_server(predict) prediction_server = make_prediction_server(predict)
serve(prediction_server, host=CONFIG.webserver.host, port=CONFIG.webserver.port, _quiet=False)
run_prediction_server(prediction_server, mode=CONFIG.webserver.mode)
def run_prediction_server(app, mode="development"):
if mode == "development":
app.run(host=CONFIG.webserver.host, port=CONFIG.webserver.port, debug=True)
elif mode == "production":
serve(app, host=CONFIG.webserver.host, port=CONFIG.webserver.port)
if __name__ == "__main__": if __name__ == "__main__":
logging_level = CONFIG.service.logging_level logging.basicConfig(level=CONFIG.service.logging_level)
logging.basicConfig(level=logging_level)
logging.getLogger("flask").setLevel(logging.ERROR)
logging.getLogger("urllib3").setLevel(logging.ERROR)
logging.getLogger("werkzeug").setLevel(logging.ERROR)
logging.getLogger("waitress").setLevel(logging.ERROR)
logging.getLogger("PIL").setLevel(logging.ERROR)
logging.getLogger("h5py").setLevel(logging.ERROR)
show_banner() show_banner()

Some files were not shown because too many files have changed in this diff Show More