Merge in RR/image-prediction from integrate-image-extraction-new-pyinfra to master
Squashed commit of the following:
commit 8470c065c71ea2a985aadfc399fb32c693e3a90f
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Aug 18 09:19:52 2022 +0200
add key script
commit 8f6eb1e79083fb32fb7bedac640c10b6fd411899
Merge: 27fd7de c1b9629
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Thu Aug 18 09:17:50 2022 +0200
Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into integrate-image-extraction-new-pyinfra
commit 27fd7de39a59d0d88fbddb471dd7797b61223ece
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 17 13:15:09 2022 +0200
update pyinfra
commit ca58f85642598dc15e286074982e7cedae9a1355
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Tue Aug 16 16:16:10 2022 +0200
update pdf2image-service
commit f43795cee0e211e14ac5f9296b01d440ae759c55
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Mon Aug 15 10:32:02 2022 +0200
update pipeline script to also work with figure detection metadata
commit 2b2da1b60ce56fb006cf2f6b65aeda9774391b2a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Fri Aug 12 13:37:48 2022 +0200
add new pyinfra, add optional image classifcation under key dataCV if figure metadata is present on storage
commit bae25bedbd3a262a9d00e18a1b19f4ee6f1eb924
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 10 13:27:41 2022 +0200
tidy-up
commit 287b0ebc8a952e506185d13508eaa386d0420704
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 10 12:57:35 2022 +0200
update server logic for new pyinfra, add extraction from scanned PDF with figure detection logic
commit 3225cefaa25e4559b105397bc06c867a22806ba8
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 10 10:37:31 2022 +0200
integrate new pyinfra logic
commit 46926078342b0680a7416560bb69bec037cf8038
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 3 13:15:27 2022 +0200
add image extraction for scanned PDFs WIP
commit 1b3b11b6f9044d44cb9a822a78197a2ebc6f306a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date: Wed Aug 3 09:41:06 2022 +0200
add pyinfra and pdf2image as git submodule
Setup
Build base image
docker build -f Dockerfile_base -t image-prediction-base .
docker build -f Dockerfile -t image-prediction .
Usage
Without Docker
py scripts/run_pipeline.py /path/to/a/pdf
With Docker
Shell 1
docker run --rm --net=host image-prediction
Shell 2
python scripts/pyinfra_mock.py /path/to/a/pdf
Tests
Run for example this command to execute all tests and get a coverage report:
coverage run -m pytest test --tb=native -q -s -vvv -x && coverage combine && coverage report -m
After having built the service container as specified above, you can also run tests in a container as follows:
./run_tests.sh
Message Body Formats
Request Format
The request messages need to provide the fields "dossierId" and "fileId". A request should look like this:
{
"dossierId": "<string identifier>",
"fileId": "<string identifier>"
}
Any additional keys are ignored.
Response Format
Response bodies contain information about the identified class of the image, the confidence of the classification, the position and size of the image as well as the results of additional convenience filters which can be configured through environment variables. A response body looks like this:
{
"dossierId": "debug",
"fileId": "13ffa9851740c8d20c4c7d1706d72f2a",
"data": [...]
}
An image metadata record (entry in "data" field of a response body) looks like this:
{
"classification": {
"label": "logo",
"probabilities": {
"logo": 1.0,
"signature": 1.1599173226749333e-17,
"other": 2.994595513398207e-23,
"formula": 4.352109377281029e-31
}
},
"position": {
"x1": 475.95,
"x2": 533.4,
"y1": 796.47,
"y2": 827.62,
"pageNumber": 6
},
"geometry": {
"width": 57.44999999999999,
"height": 31.149999999999977
},
"alpha": false,
"filters": {
"geometry": {
"imageSize": {
"quotient": 0.05975350599135938,
"tooLarge": false,
"tooSmall": false
},
"imageFormat": {
"quotient": 1.8443017656500813,
"tooTall": false,
"tooWide": false
}
},
"probability": {
"unconfident": false
},
"allPassed": true
}
}
Configuration
A configuration file is located under config.yaml. All relevant variables can be configured via
exporting environment variables.
| Environment Variable | Default | Description |
|---|---|---|
| LOGGING_LEVEL_ROOT | "INFO" | Logging level for log file messages |
| VERBOSE | true | Service prints document processing progress to stdout |
| BATCH_SIZE | 16 | Number of images in memory simultaneously per service instance |
| RUN_ID | "fabfb1f192c745369b88cab34471aba7" | The ID of the mlflow run to load the image classifier from |
| MIN_REL_IMAGE_SIZE | 0.05 | Minimally permissible image size to page size ratio |
| MAX_REL_IMAGE_SIZE | 0.75 | Maximally permissible image size to page size ratio |
| MIN_IMAGE_FORMAT | 0.1 | Minimally permissible image width to height ratio |
| MAX_IMAGE_FORMAT | 10 | Maximally permissible image width to height ratio |
See also: https://git.iqser.com/projects/RED/repos/helm/browse/redaction/templates/image-service-v2