Compare commits

...

178 Commits

Author SHA1 Message Date
Julius Unverfehrt
0027421628 feat: RED-10765: ignore perceptual hash for image deduplication and prefer to keep the ones with allPassed set to True 2025-01-31 12:59:59 +01:00
Julius Unverfehrt
00740c91b8 Merge branch 'feat/RED-10765/filter-duplicate-images' into 'master'
feat: RED-10765: filter out classifications for 'duplicate' images present in the document

Closes RED-10765

See merge request redactmanager/image-classification-service!23
2025-01-30 13:20:19 +01:00
Julius Unverfehrt
a3d79eb9af feat: RED-10765: filter out classifications for 'duplicate' images present in the document 2025-01-30 12:42:41 +01:00
Jonathan Kössler
373f9f2d01 Merge branch 'bugfix/RED-10722' into 'master'
RED-10722: fix dead letter queue

Closes RED-10722

See merge request redactmanager/image-classification-service!22
2025-01-16 09:29:11 +01:00
Jonathan Kössler
2429d90dd5 chore: update pyinfra to v3.4.2 2025-01-15 13:39:16 +01:00
Julius Unverfehrt
2b85999258 Merge branch 'fix/RM-227' into 'master'
fix: RM-227: set minimum permissable value for logos

Closes RM-227 and RED-10686

See merge request redactmanager/image-classification-service!21
2024-12-18 12:39:44 +01:00
Julius Unverfehrt
4b15d2c2ca fix: RED-10686: set minimum permissable value for logos
Reference the jira ticket for more information. This change can
introduce unwanted behavior.
2024-12-18 11:47:54 +01:00
Jonathan Kössler
bf1ca8d6f9 Merge branch 'feature/RED-10441' into 'master'
RED-10441: fix abandoned queues

Closes RED-10441

See merge request redactmanager/image-classification-service!20
2024-11-13 17:32:27 +01:00
Jonathan Kössler
9a4b8cad2b chore: update pyinfra to v3.3.5 2024-11-13 17:21:58 +01:00
Jonathan Kössler
28adb50330 chore: update pyinfra to v3.3.4 2024-11-13 16:39:49 +01:00
Jonathan Kössler
7a3fdf8fa4 chore: update pyinfra to v3.3.3 2024-11-13 14:54:29 +01:00
Jonathan Kössler
3fbcd65e9b chore: update pyinfra to v3.3.2 2024-11-13 09:56:55 +01:00
Jonathan Kössler
90a60b4b7c Merge branch 'chore/update_pyinfra' into 'master'
RES-858: fix graceful shutdown

See merge request redactmanager/image-classification-service!19
2024-09-30 11:01:24 +02:00
Jonathan Kössler
526de8984c chore: update pyinfra to v3.2.11 2024-09-30 10:12:40 +02:00
Jonathan Kössler
99cbf3c9bf Merge branch 'feature/RED-10017-fix-config' into 'master'
RED-10017: fix pyinfra config

Closes RED-10017

See merge request redactmanager/image-classification-service!18
2024-09-27 08:22:00 +02:00
Jonathan Kössler
986137e729 chore: update pyinfra to v3.2.10 2024-09-26 13:40:49 +02:00
Jonathan Kössler
f950b96cfb fix: pyinfra config 2024-09-24 14:31:10 +02:00
Francisco Schulz
2385d19bc2 Merge branch 'RED-10017-investigate-crashing-py-services-when-upload-large-number-of-files' into 'master'
RED-10017 "Investigate crashing py services when upload large number of files"

See merge request redactmanager/image-classification-service!17
2024-09-23 18:55:01 +02:00
Francisco Schulz
16f2f0d557 RED-10017 "Investigate crashing py services when upload large number of files" 2024-09-23 18:55:01 +02:00
Julius Unverfehrt
afa6fc34cb Merge branch 'improvement/RED-10018' into 'master'
feat: parameterize image stiching tolerance

Closes RED-10018

See merge request redactmanager/image-classification-service!16
2024-09-06 16:27:36 +02:00
Julius Unverfehrt
a192e05be2 feat: parameterize image stiching tolerance
Also sets image stitching tolerance default to one (pixel) and adds
informative log of which settings are loaded when initializing the
image classification pipeline.
2024-09-06 15:51:17 +02:00
Francisco Schulz
d23034e38a Merge branch 'fix/RED-9948' into 'master'
fix: regression of predictions

Closes RED-9948

See merge request redactmanager/image-classification-service!15
2024-08-30 16:06:50 +02:00
Julius Unverfehrt
4bc53cf88b chore: update pyinfra (for current features) 2024-08-30 15:54:52 +02:00
Julius Unverfehrt
e737f64ed2 fix: pin dependencies to working versions
BREAKING CHANGE

Recent pyinfra changes update tensorflow implicitely (see RED-9948).
This can be fixed by pinning tensorflow and protobuf.
However this makes the service incompatible with the current pyinfra
versions.
2024-08-30 15:54:52 +02:00
Julius Unverfehrt
4b099f0106 chore: bump poetry version 2024-08-30 15:53:35 +02:00
Julius Unverfehrt
b3a58d6777 chore: add tests to ensure no regression happens ever again 2024-08-30 15:53:07 +02:00
Julius Unverfehrt
c888453cc6 fix: pin dependencies to working versions
BREAKING CHANGE

Recent pyinfra changes update tensorflow implicitely (see RED-9948).
This can be fixed by pinning tensorflow and protobuf.
However this makes the service incompatible with the current pyinfra
versions.
2024-08-30 15:52:55 +02:00
Julius Unverfehrt
bf9ab4b1a2 chore: update run pipline script to use all parameters that are used in production 2024-08-30 15:51:10 +02:00
Julius Unverfehrt
9ff88a1e5d chore: update test data 2024-08-30 15:51:10 +02:00
Julius Unverfehrt
c852434b75 chore: add script for local and container debug 2024-08-30 15:51:10 +02:00
Jonathan Kössler
8655e25ec0 Merge branch 'feature/RES-840-add-client-connector-error' into 'master'
fix: add exception handling for ClientConnectorError

Closes RES-840

See merge request redactmanager/image-classification-service!13
2024-08-28 15:46:55 +02:00
Jonathan Kössler
103c19d4cd chore: update pyinfra version 2024-08-28 14:50:39 +02:00
Jonathan Kössler
530001a0af Merge branch 'feature/RES-826-pyinfra-update' into 'master'
chore: bump pyinfra version

Closes RES-826

See merge request redactmanager/image-classification-service!12
2024-08-26 16:15:25 +02:00
Jonathan Kössler
a6c11a9db5 chore: bump pyinfra version 2024-08-26 15:14:34 +02:00
Julius Unverfehrt
1796c1bcbb fix: RED-3813: ensure image hashes are always 25 chars long
The hashing algorithm omits leading bits without information. Since this
proves problematic for later processing, we restore this
information and ensure the hashes are always 25 characters long.
2024-08-22 11:15:41 +02:00
Jonathan Kössler
f4b9ff54aa chore: bump pyinfra version 2024-08-22 09:34:40 +02:00
Jonathan Kössler
278b42e368 Merge branch 'bugfix/set-image-tags' into 'master'
fix: version reference

See merge request redactmanager/image-classification-service!11
2024-08-20 09:46:55 +02:00
Jonathan Kössler
9600e4ca23 chore: bump version 2024-08-20 09:34:54 +02:00
Jonathan Kössler
8485345dd1 fix: version reference 2024-08-19 16:32:44 +02:00
Jonathan Kössler
d1a523c7d6 Merge branch 'feature/RES-731-add-queues-per-tenant' into 'master'
RES-731: add queues per tenant

Closes RES-731

See merge request redactmanager/image-classification-service!9
2024-08-19 15:12:03 +02:00
Jonathan Kössler
278f54eaa7 RES-731: add queues per tenant 2024-08-19 15:12:03 +02:00
Julius Unverfehrt
443c2614f9 Merge branch 'RED-9746' into 'master'
fix: add small image filter logic

Closes RED-9746

See merge request redactmanager/image-classification-service!10
2024-08-07 13:50:28 +02:00
Julius Unverfehrt
4102a564a3 fix: add small image filter logic
Introduces a preprocessing that scans each page for page sized images.
If one is encountered, all images that are below a configured ratio in
respect to the page size are dropped.

This step has to occur before the image stiching logic, but MIGHT
introduce the problem of dropping image parts that might constitue an
image. This hoever is not solveable since we want to drop the small
images before further processing since the faulty character images are
also stiched to a valid image, that in reality isn't an image.
2024-08-06 16:52:05 +02:00
Julius Unverfehrt
7f49642ba0 fix: RED-8978: update pyinfra 2024-04-16 16:42:10 +02:00
Julius Unverfehrt
ba8d1dfdfe chore(logger): support spring log levels 2024-02-28 16:34:23 +01:00
Julius Unverfehrt
150d0d64e5 chore(prediction filters): adapt class specific filter logic 2024-02-09 11:36:51 +01:00
Julius Unverfehrt
a024ddfcf7 Merge branch 'RES-534-update-pyinfra' into 'master'
feat(opentel,dynaconf): adapt new pyinfra

Closes RES-534

See merge request redactmanager/image-classification-service!8
2024-02-09 09:59:11 +01:00
Julius Unverfehrt
13cbfa4ddf chore(tests): disable integration test 2024-02-09 09:50:59 +01:00
Julius Unverfehrt
75af55dbda chore(project structure): use src/ structure 2024-02-09 09:47:42 +01:00
Julius Unverfehrt
499c501acf feat(opentel,dynaconf): adapt new pyinfra
Also changes logging to knutils logging.
2024-02-09 09:47:31 +01:00
Julius Unverfehrt
6163e29d6b fix(pdf conversion): repair broken bad x-ref handling 2024-02-08 17:16:41 +01:00
Francisco Schulz
dadc0a4163 Merge branch 'RED-7958-logging-issues-of-python-services' into 'master'
RED-7958 "Logging issues of python services"

See merge request redactmanager/image-classification-service!6
2023-12-12 11:29:46 +01:00
Francisco Schulz
729ce17de0 use .pdf as integration test file 2023-12-11 11:32:14 +01:00
francisco.schulz
88fbe077e6 fix: poetry install --without=dev 2023-12-11 10:40:06 +01:00
francisco.schulz
f8ecef1054 update dependencies 2023-12-11 10:39:27 +01:00
Francisco Schulz
5f44cc6560 use integration test default branch 2023-12-07 13:23:53 +01:00
francisco.schulz
b60f4d0383 use python 3.10 2023-11-28 15:57:53 +01:00
francisco.schulz
87873cc3a3 update dependencies 2023-11-28 15:57:45 +01:00
francisco.schulz
523ca1db7d use latest CI template 2023-11-28 15:57:36 +01:00
Julius Unverfehrt
c25f6902e0 Merge branch 'feature/RED-6685-support-absolute-paths' into 'master'
Upgrade pyinfra (absolute FP support)

Closes RED-6685

See merge request redactmanager/image-classification-service!5
2023-08-23 15:04:59 +02:00
Julius Unverfehrt
9e336ecc01 Upgrade pyinfra (absolute FP support)
- Update pyinfra with absolute file path support (still supports
  dossierID fileID format)
- Update CI, use new template
2023-08-23 14:53:40 +02:00
Julius Unverfehrt
0efa2127d7 Merge branch 'fix/RED-7388-nack-message-if-processing-failure' into 'master'
Adjust error handling of processing sub-process

Closes RED-7388

See merge request redactmanager/image-classification-service!4
2023-08-17 13:40:11 +02:00
Julius Unverfehrt
501fd48d69 Adjust error handling of processing sub-process
Removes exception catching when collecting subprocess result which led
to the service silently go over failing file processing.

Now, the sub-process doesn't return any results if it failed. It is made
sure that an empty result is still returned if no images were present on
the file to process.
2023-08-17 13:26:27 +02:00
Julius Unverfehrt
4a825cb264 Merge branch 'RES-196-red-hotfix-persistent-service-address' into 'master'
Resolve RES-196 "Red hotfix persistent service address"

Closes RES-196

See merge request redactmanager/image-classification-service!2
2023-06-26 12:56:13 +02:00
francisco.schulz
694a6ccb33 copy test dir into container 2023-06-22 12:03:09 +02:00
francisco.schulz
1d043f97fc add ipykernel 2023-06-22 12:02:50 +02:00
francisco.schulz
7cac73f07b update dependencies, pyinfra@1.5.9 2023-06-21 15:43:24 +02:00
francisco.schulz
133fde67ba add startup probe script 2023-06-21 15:42:33 +02:00
francisco.schulz
946cfff630 add docker scripts 2023-06-21 15:42:21 +02:00
francisco.schulz
f73264874e copy scripts 2023-06-21 15:42:14 +02:00
francisco.schulz
d3868efb4e update CI 2023-06-19 11:49:35 +02:00
francisco.schulz
f0c2282197 increment version 2023-06-19 11:35:16 +02:00
francisco.schulz
57e1ec1a14 increment version 2023-06-19 11:26:22 +02:00
francisco.schulz
8b9771373b copy image_prediction folder, not just files 2023-06-19 11:26:15 +02:00
francisco.schulz
cd3ce653e1 formatting & add pymonad 2023-06-19 11:11:29 +02:00
Julius Unverfehrt
d8075aad38 Update pyinfra to support new tenant endpoint 2023-06-15 16:59:47 +02:00
Francisco Schulz
2b3043bc1e Merge branch 'RES-141-migrate-red-image-service' into 'master'
Resolve RES-141 "Migrate red image service"

Closes RES-141

See merge request redactmanager/image-classification-service!1
2023-06-14 16:01:41 +02:00
Matthias Bisping
3ad0345f4e Remove unused dependency 'pdf2img' 2023-06-14 12:55:47 +02:00
francisco.schulz
134156f59d update project name to image-classification-service 2023-06-12 13:02:38 +02:00
francisco.schulz
1205f2e0ed update 2023-06-07 17:44:54 +02:00
francisco.schulz
8ee966c721 add CI 2023-06-07 17:44:41 +02:00
francisco.schulz
892742ef17 update 2023-06-07 17:44:35 +02:00
francisco.schulz
06b1af9f1a update 2023-06-07 17:44:30 +02:00
francisco.schulz
0194ce3f7e add setup convenience script 2023-06-07 17:44:23 +02:00
francisco.schulz
41d08f7b5b update dependencies 2023-06-07 17:42:25 +02:00
francisco.schulz
b91d5a0ab2 move to script folder 2023-06-07 17:42:11 +02:00
francisco.schulz
7b37f3c913 update Dockerfiles 2023-06-07 17:41:54 +02:00
francisco.schulz
c32005b841 remove old CI files 2023-06-07 17:12:14 +02:00
Julius Unverfehrt
6406ce6b25 Pull request #48: RED-6273 multi tenant storage
Merge in RR/image-prediction from RED-6273-multi-tenant-storage to master

* commit '4ecafb29770b7392462c71d79550c5f788cb36e6':
  update pyinfra version with removed falsy dependencies from pyinfra
  update pyinfra for bugfix
  Update pyinfra for multi-tenancy support
2023-03-28 18:11:09 +02:00
Julius Unverfehrt
4ecafb2977 update pyinfra version with removed falsy dependencies from pyinfra 2023-03-28 18:03:38 +02:00
Julius Unverfehrt
967c2fad1b update pyinfra for bugfix 2023-03-28 17:27:57 +02:00
Julius Unverfehrt
b74e79f113 Update pyinfra for multi-tenancy support 2023-03-28 15:36:05 +02:00
Julius Unverfehrt
50c791f6ca Pull request #47: update pyinfra with fixed prometheus port
Merge in RR/image-prediction from bugfix/RED-6205-prometheus-port to master

* commit 'adb363842dff3d43b3a0dc499daa16588d34233c':
  update pyinfra with fixed prometheus port
2023-03-21 16:08:27 +01:00
Julius Unverfehrt
adb363842d update pyinfra with fixed prometheus port 2023-03-21 16:01:39 +01:00
Julius Unverfehrt
81520b1a53 Pull request #46: RED-6205 add prometheus monitoring
Merge in RR/image-prediction from RED-6205-add-prometheus-monitoring to master

Squashed commit of the following:

commit 6932b5ee579a31d0317dc3f76acb8dd2845fdb4b
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Mar 16 17:30:57 2023 +0100

    update pyinfra

commit d6e55534623eae2edcddaa6dd333f93171d421dc
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Mar 16 16:30:14 2023 +0100

    set pyinfra subproject to current master commit

commit 030dc660e6060ae326c32fba8c2944a10866fbb6
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Mar 16 16:25:19 2023 +0100

    adapt serve script to advanced pyinfra API including monitoring of the processing time of images.

commit 0fa0c44c376c52653e517d257a35793797f7be31
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Mar 16 15:19:57 2023 +0100

    Update dockerfile to work with new pyinfra package setup utilizing pyproject.toml instad of setup.py and requirments.txt

commit aad53c4d313f908de93a13e69e2cb150db3be6cb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Mar 16 14:16:04 2023 +0100

    remove no longer needed dependencies
2023-03-17 16:12:59 +01:00
Shamel Hussain
ed25af33ad Pull request #45: RED-5718: Revert user changes to allow using a random id
Merge in RR/image-prediction from RED-5718-revertUser to master

* commit '1967945ff7550d706295a1a46f50393959852773':
  RED-5718: Revert user changes to allow using a random id
2023-02-28 14:56:23 +01:00
shamel-hussain
1967945ff7 RED-5718: Revert user changes to allow using a random id 2023-02-28 12:08:32 +01:00
Shamel Hussain
faf4d7ed0f Pull request #44: RED-5718: Add a specific user to the image-prediction service
Merge in RR/image-prediction from RED-5718-imageUser to master

* commit '7c7b038491b39d2162e901a5a0ef62b2f1ebd4a9':
  RED-5718: Add a specific user to the image-prediction service
2023-02-27 09:26:12 +01:00
shamel-hussain
7c7b038491 RED-5718: Add a specific user to the image-prediction service 2023-02-27 09:19:37 +01:00
Julius Unverfehrt
cd3e215776 Pull request #43: upgrade references
Merge in RR/image-prediction from RED-6118-multi-tenancy-patch to master

* commit 'bc1bd96e6c8fe904f0fc61a5701cd03dd369806c':
  upgrade references
2023-02-16 16:52:23 +01:00
Julius Unverfehrt
bc1bd96e6c upgrade references 2023-02-16 16:50:59 +01:00
Francisco Schulz
2001e9d7f3 Pull request #42: Bugfix/RED-5277 heartbeat
Merge in RR/image-prediction from bugfix/RED-5277-heartbeat to master

* commit '846f127d3ba75c1be124ddc780a4f9c849dc84af':
  update reference
  fix type
  remove commented out code
  import logger from `__init__.py`
  add log config to `__init__.py`
  remove extra stream handler
  update reference
  update refrence
  update reference
  update reference
  update reference
  build dev image and push to nexus
  add logging & only return one object from `process_request()`
  cache loaded pipeline & disable tqdm output by default
  format + set verbose to False by default
  update
2023-02-16 09:54:07 +01:00
Francisco Schulz
846f127d3b update reference 2023-02-16 09:50:17 +01:00
Francisco Schulz
d4657f1ab1 fix type 2023-02-15 16:46:47 +01:00
Francisco Schulz
ee99d76aab remove commented out code 2023-02-15 15:51:33 +01:00
Francisco Schulz
00b40c0632 import logger from __init__.py 2023-02-15 15:45:20 +01:00
Francisco Schulz
c1ae8e6a4b add log config to __init__.py 2023-02-15 15:44:56 +01:00
Francisco Schulz
0bdf5a726a remove extra stream handler 2023-02-15 15:25:13 +01:00
Francisco Schulz
d505ac4e50 update reference 2023-02-15 15:01:29 +01:00
Francisco Schulz
7dca05a53d update refrence 2023-02-15 11:11:23 +01:00
Francisco Schulz
c1449134ec update reference 2023-02-15 10:23:27 +01:00
Francisco Schulz
29c76e7ebf update reference 2023-02-14 18:02:09 +01:00
Francisco Schulz
ecc9f69d9c update reference 2023-02-14 16:52:56 +01:00
Francisco Schulz
4bcadcd266 build dev image and push to nexus 2023-02-14 16:30:18 +01:00
Francisco Schulz
9065ec1d12 add logging & only return one object from process_request() 2023-02-14 16:29:04 +01:00
Francisco Schulz
d239368d70 cache loaded pipeline & disable tqdm output by default 2023-02-14 16:27:21 +01:00
Francisco Schulz
b5dc5aa777 format + set verbose to False by default 2023-02-14 16:26:24 +01:00
Francisco Schulz
54b7ba24e8 update 2023-02-14 16:25:49 +01:00
Julius Unverfehrt
463f4da92b Pull request #41: RED-6189 bugfix
Merge in RR/image-prediction from RED-6189-bugfix to master

* commit '79455f0dd6da835ef2261393c5a57ba8ef2550ab': (25 commits)
  revert refactoring  changes
  replace image extraction logic final
  introduce normalizing function for image extraction
  refactoring
  adjust behavior of filtering of invalid images
  add log in callback to diplay which file is processed
  add ad hoc logic for bad xref handling
  beautify
  beautify
  implement ad hoc channel count detection for new image extraction
  improve performance
  refactor scanned page filtering
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor scanned page filtering WIP
  refactor
  ...
2023-02-13 17:35:04 +01:00
Julius Unverfehrt
79455f0dd6 Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into RED-6189-bugfix 2023-02-13 17:23:07 +01:00
Julius Unverfehrt
2bc9c24f6a revert refactoring changes
- revert functional refactoring changes to be able
to determine where the error described in the ticket comes from
- change array normalization to dimensionally
sparse arrays to reduce memory consumption
2023-02-13 13:53:35 +01:00
Julius Unverfehrt
ea301b4df2 Pull request #40: replace trace log level by debug
Merge in RR/image-prediction from adjust-falsy-loglevel to master

Squashed commit of the following:

commit 66794acb1a64be6341f98c7c0ce0bc202634a9f4
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Fri Feb 10 10:15:41 2023 +0100

    replace trace log level by debug

    - trace method is not supported by buld-in logging module
2023-02-10 10:18:38 +01:00
Matthias Bisping
5cdf93b923 Pull request #39: RED-6084 Improve image extraction speed
Merge in RR/image-prediction from RED-6084-adhoc-scanned-pages-filtering-refactoring to master

Squashed commit of the following:

commit bd6d83e7363b1c1993babcceb434110a6312c645
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Thu Feb 9 16:08:25 2023 +0100

    Tweak logging

commit 55bdd48d2a3462a8b4a6b7194c4a46b21d74c455
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Thu Feb 9 15:47:31 2023 +0100

    Update dependencies

commit 970275b25708c05e4fbe78b52aa70d791d5ff17a
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Thu Feb 9 15:35:37 2023 +0100

    Refactoring

    Make alpha channel check monadic to streamline error handling

commit e99e97e23fd8ce16f9a421d3e5442fccacf71ead
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Tue Feb 7 14:32:29 2023 +0100

    Refactoring

    - Rename
    - Refactor image extraction functions

commit 76b1b0ca2401495ec03ba2b6483091b52732eb81
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Tue Feb 7 11:55:30 2023 +0100

    Refactoring

commit cb1c461049d7c43ec340302f466447da9f95a499
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Tue Feb 7 11:44:01 2023 +0100

    Refactoring

commit 092069221a85ac7ac19bf838dcbc7ab1fde1e12b
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Tue Feb 7 10:18:53 2023 +0100

    Add to-do

commit 3cea4dad2d9703b8c79ddeb740b66a3b8255bb2a
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Tue Feb 7 10:11:35 2023 +0100

    Refactoring

    - Rename
    - Add typehints everywhere

commit 865e0819a14c420bc2edff454d41092c11c019a4
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 19:38:57 2023 +0100

    Add type explanation

commit 01d3d5d33f1ccb05aea1cec1d1577572b1a4deaa
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 19:37:49 2023 +0100

    Formatting

commit dffe1c18fc3a322a6b08890d4438844e8122faaf
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 19:34:13 2023 +0100

    [WIP] Either refactoring

    Add alternative formulation for monadic chain

commit 066cf17add404a313520cd794c06e3264cf971c9
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 18:40:30 2023 +0100

    [WIP] Either refactoring

commit f53f0fea298cdab88deb090af328b34d37e0198e
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 18:18:34 2023 +0100

    [WIP] Either refactoring

    Propagate error and metadata

commit 274a5f56d4fcb9c67fac5cf43e9412ec1ab5179e
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 17:51:35 2023 +0100

    [WIP] Either refactoring

    Fix test assertion

commit 3235a857f6e418e50484cbfff152b0f63efb2f53
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 16:57:31 2023 +0100

    [WIP] Either-refactoring

    Replace Maybe with Either to allow passing on error information or
    metadata which otherwise get sucked up by Nothing.

commit 89989543d87490f8b20a0a76055605d34345e8f4
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 16:12:40 2023 +0100

    [WIP] Monadic refactoring

    Integrate image validation step into monadic chain.

    At the moment we lost the error information through this. Refactoring to
    Either monad can bring it back.

commit 022bd4856a51aa085df5fe983fd77b99b53d594c
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 15:16:41 2023 +0100

    [WIP] Monadic refactoring

commit ca3898cb539607c8c3dd01c57e60211a5fea8a7d
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 15:10:34 2023 +0100

    [WIP] Monadic refactoring

commit d8f37bed5cbd6bdd2a0b52bae46fcdbb50f9dff2
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 15:09:51 2023 +0100

    [WIP] Monadic refactoring

commit 906fee0e5df051f38076aa1d2725e52a182ade13
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Feb 6 15:03:35 2023 +0100

    [WIP] Monadic refactoring

... and 35 more commits
2023-02-10 08:33:13 +01:00
Julius Unverfehrt
4d43e385c5 replace image extraction logic final 2023-02-06 09:43:28 +01:00
Julius Unverfehrt
bd0279ddd1 introduce normalizing function for image extraction 2023-02-03 12:25:27 +01:00
Julius Unverfehrt
2995d5ee48 refactoring 2023-02-03 11:14:14 +01:00
Julius Unverfehrt
eff1bb4124 adjust behavior of filtering of invalid images 2023-02-03 09:04:02 +01:00
Julius Unverfehrt
c478333111 add log in callback to diplay which file is processed 2023-02-03 08:25:36 +01:00
Julius Unverfehrt
978f48e8f9 add ad hoc logic for bad xref handling 2023-02-02 15:39:44 +01:00
Julius Unverfehrt
94652aafe4 beautify 2023-02-02 15:26:33 +01:00
Julius Unverfehrt
c4416636c0 beautify 2023-02-02 14:10:32 +01:00
Julius Unverfehrt
c0b41e77b8 implement ad hoc channel count detection for new image extraction 2023-02-02 13:57:56 +01:00
Julius Unverfehrt
73f7491c8f improve performance
- disable scanned page filter, since dropping these disables the
computation of the images hash and the frontend OCR hint, which are both
wanted
- optimize image extraction by using arrays instead of byte streams for
the conversion to PIL images
2023-02-02 13:37:03 +01:00
Julius Unverfehrt
2385584dcb refactor scanned page filtering 2023-02-01 15:49:36 +01:00
Julius Unverfehrt
b880e892ec refactor scanned page filtering WIP 2023-02-01 15:47:40 +01:00
Julius Unverfehrt
8c7349c2d1 refactor scanned page filtering WIP 2023-02-01 15:36:16 +01:00
Julius Unverfehrt
c55777e339 refactor scanned page filtering WIP 2023-02-01 15:16:12 +01:00
Julius Unverfehrt
0f440bdb09 refactor scanned page filtering WIP 2023-02-01 15:14:27 +01:00
Julius Unverfehrt
436a32ad2b refactor scanned page filtering WIP 2023-02-01 15:07:35 +01:00
Julius Unverfehrt
9ec6cc19ba refactor scanned page filtering WIP 2023-02-01 14:53:26 +01:00
Julius Unverfehrt
2d385b0a73 refactor scanned page filtering WIP 2023-02-01 14:38:55 +01:00
Julius Unverfehrt
5bd5e0cf2b refactor
- reduce code duplication by adapting functions of the module
- use the modules enums for image metadata
- improve readabilty of the scanned page detection heuristic
2023-02-01 12:43:59 +01:00
Julius Unverfehrt
876260f403 improve the readability of variable names and docstrings 2023-02-01 10:08:36 +01:00
Julius Unverfehrt
368c54a8be clean-up filter logic
- Logic adapted so that it can potentially be
easily removed again from the extraction logic
2023-02-01 08:49:30 +01:00
Julius Unverfehrt
1490d27308 introduce adhoc filter for scanned pages 2023-01-31 17:18:28 +01:00
Julius Unverfehrt
4eb7f3c40a rename publishing flag 2023-01-31 10:37:27 +01:00
Julius Unverfehrt
98dc001123 revert adhoc figure detection changes
- revert pipeline and serve logic to pre figure detection data for image
extraction changes: figure detection data as input not supported for now
2023-01-30 12:41:22 +01:00
Francisco Schulz
25fc7d84b9 Pull request #38: update dependencies
Merge in RR/image-prediction from fschulz/update-to-new-pyinfra-version to master

* commit 'd63f8c4eaf39ef7346188b585fb9d968de72db87':
  update dependencies
2022-10-13 15:33:53 +02:00
Francisco Schulz
d63f8c4eaf update dependencies 2022-10-13 15:23:27 +02:00
Viktor Seifert
549b2aac5c Pull request #37: RED-5324: Update pyinfra to include storage-region fix
Merge in RR/image-prediction from RED-5324 to master

* commit 'c72ef26a6caac8d87cdc08dd19dbe235247129d4':
  RED-5324: Update pyinfra to include storage-region fix
2022-09-30 15:27:03 +02:00
Viktor Seifert
c72ef26a6c RED-5324: Update pyinfra to include storage-region fix 2022-09-30 15:24:18 +02:00
Julius Unverfehrt
561a7f527c Pull request #36: RED-4206 wrap queue callback in process to manage memory allocation with the operating system and force deallocation after processing.
Merge in RR/image-prediction from RED-4206-fix-unwanted-restart-bug to master

Squashed commit of the following:

commit 3dfe7b861816ef9019103e16a23efd97a08fb617
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Sep 22 13:53:32 2022 +0200

    RED-4206 wrap queue callback in process to manage memory allocation with the operating system and force deallocation after processing.
2022-09-22 13:56:44 +02:00
Julius Unverfehrt
48dd52131d Pull request #35: update test dockerfile
Merge in RR/image-prediction from make-sec-build-work to master

Squashed commit of the following:

commit 08149d3a99681f4900a7d4b6a5f656b1c25ebdb3
Merge: 76b5a45 0538377
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Sep 21 13:43:24 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into make-sec-build-work

commit 76b5a4504adc709107af9e5958970ec24ae3f5ef
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Sep 21 13:41:46 2022 +0200

    update test dockerfile
2022-09-21 13:47:40 +02:00
Christoph Schabert
053837722b Pull request #34: hotfix: fix key prepare
Merge in RR/image-prediction from hotfix/keyPrep to master

* commit '98e639d83f72f0cde34cb9c009d84ed4e3b0d138':
  hotfix: fix key prepare
2022-09-20 11:36:11 +02:00
cschabert
98e639d83f hotfix: fix key prepare 2022-09-20 11:34:55 +02:00
Julius Unverfehrt
13d4427c78 Pull request #33: RED-5202 port hotfixes
Merge in RR/image-prediction from RED-5202-port-hotfixes to master

Squashed commit of the following:

commit 9674901235264de6b74d679fd39a52775ac4aee1
Merge: ec2ab89 9763d2c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:55:58 2022 +0200

    Merge remote-tracking branch 'origin' into RED-5202-port-hotfixes

commit ec2ab890b8307942d147d6b8b236f6a3c1d0aebc
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:49:17 2022 +0200

    swap case when the log is printed for env var parsing

commit aaa02ea35e9c1b3b307116d7e3e32c93fd79ef5d
Merge: 5d87066 521222e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:28:39 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into RED-5202-port-hotfixes

commit 5d87066b40b28f919b1346f5e5396b46445b4e00
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:25:01 2022 +0200

    remove warning log for non existent non default env var

commit 23c61ef49ef918b29952150d4a6e61b99d60ac64
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:14:19 2022 +0200

    make env var parser discrete

commit c1b92270354c764861da0f7782348e9cd0725d76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:28:44 2022 +0200

    fixed statefulness issue with os.environ in tests

commit ad9c5657fe93079d5646ba2b70fa091e8d2daf76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:04:55 2022 +0200

    - Adapted response formatting logic for threshold maps passed via env vars.
    - Added test for reading threshold maps and values from env vars.

commit c60e8cd6781b8e0c3ec69ccd0a25375803de26f0
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 11:38:01 2022 +0200

    add parser for environment variables WIP

commit 101b71726c697f30ec9298ba62d2203bd7da2efb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:52:33 2022 +0200

    Add typehints, make custom page quotient breach function private since the intention of outsourcing it from build_image_info is to make it testable seperately

commit 04aee4e62781e78cd54c6d20e961dcd7bf1fc081
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:25:59 2022 +0200

    DotIndexable default get method exception made more specific

commit 4584e7ba66400033dc5f1a38473b644eeb11e67c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:55:05 2022 +0200

    RED-5202 port temporary broken image handling so the hotfix won't be lost by upgrading the service. A proper solution is still desirable (see RED-5148)

commit 5f99622646b3f6d3a842aebef91ff8e082072cd6
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:47:02 2022 +0200

    RED-5202 add per class customizable max image to page quotient setting for signatures, default is 0.4. Can be overwritten by , set to null to use default value or set to value that should be used.
2022-09-12 15:59:50 +02:00
Julius Unverfehrt
9763d2ca65 Pull request #32: RED-5202 port hotfixes
Merge in RR/image-prediction from RED-5202-port-hotfixes to master

Squashed commit of the following:

commit aaa02ea35e9c1b3b307116d7e3e32c93fd79ef5d
Merge: 5d87066 521222e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:28:39 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into RED-5202-port-hotfixes

commit 5d87066b40b28f919b1346f5e5396b46445b4e00
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:25:01 2022 +0200

    remove warning log for non existent non default env var

commit 23c61ef49ef918b29952150d4a6e61b99d60ac64
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 15:14:19 2022 +0200

    make env var parser discrete

commit c1b92270354c764861da0f7782348e9cd0725d76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:28:44 2022 +0200

    fixed statefulness issue with os.environ in tests

commit ad9c5657fe93079d5646ba2b70fa091e8d2daf76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:04:55 2022 +0200

    - Adapted response formatting logic for threshold maps passed via env vars.
    - Added test for reading threshold maps and values from env vars.

commit c60e8cd6781b8e0c3ec69ccd0a25375803de26f0
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 11:38:01 2022 +0200

    add parser for environment variables WIP

commit 101b71726c697f30ec9298ba62d2203bd7da2efb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:52:33 2022 +0200

    Add typehints, make custom page quotient breach function private since the intention of outsourcing it from build_image_info is to make it testable seperately

commit 04aee4e62781e78cd54c6d20e961dcd7bf1fc081
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:25:59 2022 +0200

    DotIndexable default get method exception made more specific

commit 4584e7ba66400033dc5f1a38473b644eeb11e67c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:55:05 2022 +0200

    RED-5202 port temporary broken image handling so the hotfix won't be lost by upgrading the service. A proper solution is still desirable (see RED-5148)

commit 5f99622646b3f6d3a842aebef91ff8e082072cd6
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:47:02 2022 +0200

    RED-5202 add per class customizable max image to page quotient setting for signatures, default is 0.4. Can be overwritten by , set to null to use default value or set to value that should be used.
2022-09-12 15:29:47 +02:00
Julius Unverfehrt
521222eb96 Pull request #31: RED-5202 port hotfixes
Merge in RR/image-prediction from RED-5202-port-hotfixes to master

Squashed commit of the following:

commit c1b92270354c764861da0f7782348e9cd0725d76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:28:44 2022 +0200

    fixed statefulness issue with os.environ in tests

commit ad9c5657fe93079d5646ba2b70fa091e8d2daf76
Author: Matthias Bisping <matthias.bisping@axbit.com>
Date:   Mon Sep 12 13:04:55 2022 +0200

    - Adapted response formatting logic for threshold maps passed via env vars.
    - Added test for reading threshold maps and values from env vars.

commit c60e8cd6781b8e0c3ec69ccd0a25375803de26f0
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 11:38:01 2022 +0200

    add parser for environment variables WIP

commit 101b71726c697f30ec9298ba62d2203bd7da2efb
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:52:33 2022 +0200

    Add typehints, make custom page quotient breach function private since the intention of outsourcing it from build_image_info is to make it testable seperately

commit 04aee4e62781e78cd54c6d20e961dcd7bf1fc081
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 09:25:59 2022 +0200

    DotIndexable default get method exception made more specific

commit 4584e7ba66400033dc5f1a38473b644eeb11e67c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:55:05 2022 +0200

    RED-5202 port temporary broken image handling so the hotfix won't be lost by upgrading the service. A proper solution is still desirable (see RED-5148)

commit 5f99622646b3f6d3a842aebef91ff8e082072cd6
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Sep 12 08:47:02 2022 +0200

    RED-5202 add per class customizable max image to page quotient setting for signatures, default is 0.4. Can be overwritten by , set to null to use default value or set to value that should be used.
2022-09-12 14:49:56 +02:00
Julius Unverfehrt
ebfdc14265 Pull request #25: RED-5009 update pyinfra to support message rejection on unobtainable files
Merge in RR/image-prediction from RED-5009-update-pyinfra to master

* commit 'e54819e687b4515c0031df431e26bee033359099':
  RED-5009 update pyinfra to support message rejection on unobtainable files
2022-08-24 15:23:59 +02:00
Julius Unverfehrt
e54819e687 RED-5009 update pyinfra to support message rejection on unobtainable files 2022-08-24 15:21:53 +02:00
Julius Unverfehrt
d1190f7efe Pull request #24: queue callback: add storage lookup for input file, add should_publish flag to signal processing success to queue manager
Merge in RR/image-prediction from RED-5009-extend-callback to master

Squashed commit of the following:

commit 5ed02af09812783c46c2fb47832fe3a02344aa03
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 23 10:56:37 2022 +0200

    queue callback: add storage lookup for input file, add should_publish flag to signal processing success to queue manager
2022-08-23 12:47:49 +02:00
Julius Unverfehrt
d13b8436e2 Pull request #23: add pdf2image & pyinfra installation
Merge in RR/image-prediction from update-build-scripts to master

Squashed commit of the following:

commit 4a5b21d6e6e0d76091443ba3faaad15953855bad
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Aug 18 15:08:13 2022 +0200

    add pdf2image & pyinfra installation
2022-08-18 15:09:51 +02:00
Julius Unverfehrt
520eee26e3 Pull request #22: Integrate image extraction new pyinfra
Merge in RR/image-prediction from integrate-image-extraction-new-pyinfra to master

Squashed commit of the following:

commit 8470c065c71ea2a985aadfc399fb32c693e3a90f
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Aug 18 09:19:52 2022 +0200

    add key script

commit 8f6eb1e79083fb32fb7bedac640c10b6fd411899
Merge: 27fd7de c1b9629
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Thu Aug 18 09:17:50 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into integrate-image-extraction-new-pyinfra

commit 27fd7de39a59d0d88fbddb471dd7797b61223ece
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 17 13:15:09 2022 +0200

    update pyinfra

commit ca58f85642598dc15e286074982e7cedae9a1355
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Aug 16 16:16:10 2022 +0200

    update pdf2image-service

commit f43795cee0e211e14ac5f9296b01d440ae759c55
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Aug 15 10:32:02 2022 +0200

    update pipeline script to also work with figure detection metadata

commit 2b2da1b60ce56fb006cf2f6b65aeda9774391b2a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Fri Aug 12 13:37:48 2022 +0200

    add new pyinfra, add optional image classifcation under key dataCV if figure metadata is present on storage

commit bae25bedbd3a262a9d00e18a1b19f4ee6f1eb924
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 10 13:27:41 2022 +0200

    tidy-up

commit 287b0ebc8a952e506185d13508eaa386d0420704
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 10 12:57:35 2022 +0200

    update server logic for new pyinfra, add extraction from scanned PDF with figure detection logic

commit 3225cefaa25e4559b105397bc06c867a22806ba8
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 10 10:37:31 2022 +0200

    integrate new pyinfra logic

commit 46926078342b0680a7416560bb69bec037cf8038
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 3 13:15:27 2022 +0200

    add image extraction for scanned PDFs WIP

commit 1b3b11b6f9044d44cb9a822a78197a2ebc6f306a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Wed Aug 3 09:41:06 2022 +0200

    add pyinfra and pdf2image as git submodule
2022-08-18 09:20:48 +02:00
Christoph Schabert
c1b96290df Pull request #20: RED-4758: adjust to new buildjob
Merge in RR/image-prediction from RED-4758 to master

* commit '3405a34893e0b45e4eabc7d78380b529f5ef2aa4':
  RED-4758: adjust to new buildjob
2022-08-03 15:43:54 +02:00
cschabert
3405a34893 RED-4758: adjust to new buildjob 2022-08-03 14:59:07 +02:00
Matthias Bisping
f787b957f8 Pull request #18: Docstrfix
Merge in RR/image-prediction from docstrfix to master

Squashed commit of the following:

commit 8ccb07037074cc88ba5b72e4bedd5bc346eb0256
Merge: 77cd0a8 5d611d5
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Jul 4 11:50:52 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into docstrfix

commit 77cd0a860a69bfb8f4390dabdca23455b340bd9e
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Jul 4 11:46:25 2022 +0200

    fixed docstring

commit eb53464ca9f1ccf881d90ece592ad50226decd7a
Merge: 4efb9c7 fd0e4dc
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Jun 21 15:22:03 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction

commit 4efb9c79b10f23fa556ce43c8e7f05944dae1af6
Merge: 84a8b0a 9f18ef9
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 11:51:30 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction

commit 84a8b0a290081616240c3876f8db8a1ae8592096
Merge: 1624ee4 6030f40
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 10:18:56 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction

commit 1624ee40376b84a4519025343f913120c464407a
Author: Matthias Bisping <Matthias.Bisping@iqser.com>
Date:   Mon Apr 25 16:51:13 2022 +0200

    Pull request #11: fixed assignment

    Merge in RR/image-prediction from image_prediction_service_overhaul_xref_and_empty_result_fix_fix to master

    Squashed commit of the following:

    commit 7312e57d1127b081bfdc6e96311e8348d3f8110d
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 16:45:12 2022 +0200

        logging setup changed

    commit 955e353d74f414ee2d57b234bdf84d32817d14bf
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 16:37:52 2022 +0200

        fixed assignment
2022-07-04 11:57:01 +02:00
Julius Unverfehrt
5d611d5fae Pull request #17: RED-4329 add prometheus
Merge in RR/image-prediction from RED-4329-add-prometheus to master

Squashed commit of the following:

commit 7fcf256c5277a3cfafcaf76c3116e3643ad01fa4
Merge: 8381621 c14d00c
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:41:14 2022 +0200

    Merge branch 'master' into RED-4329-add-prometheus

commit 8381621ae08b1a91563c9c655020ec55bb58ecc5
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:24:50 2022 +0200

    add prometheus endpoint

commit 26f07088b0a711b6f9db0974f5dfc8aa8ad4e1dc
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:14:34 2022 +0200

    refactor

commit c563aa505018f8a14931a16a9061d361b5d4c383
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:10:19 2022 +0200

    test bamboo build

commit 2b8446e703617c6897b6149846f2548ec292a9a1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 14:40:44 2022 +0200

    RED-4329 add prometheus endpoint with summary metric
2022-06-21 15:49:01 +02:00
Julius Unverfehrt
c14d00cac8 Pull request #16: restore master
Merge in RR/image-prediction from restore-master to master

Squashed commit of the following:

commit 937968241f08281859be4304bdb0d8eff49f3678
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:30:41 2022 +0200

    refactor

commit b8b84548ef187bfcb88ef97fda74508e37dfb967
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 15:21:37 2022 +0200

    restore master
2022-06-21 15:39:37 +02:00
Julius Unverfehrt
fd0e4dc3cf Pull request #14: RED-4329 add prometheus endpoint with summary metric
Merge in RR/image-prediction from RED-4329-add-prometheus to master

Squashed commit of the following:

commit 2b8446e703617c6897b6149846f2548ec292a9a1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Tue Jun 21 14:40:44 2022 +0200

    RED-4329 add prometheus endpoint with summary metric
2022-06-21 14:57:39 +02:00
Matthias Bisping
9f18ef9cd1 Pull request #13: Image representation info
Merge in RR/image-prediction from image_representation_metadata to master

Squashed commit of the following:

commit bfe92b24a2959a72c0e913ef051476c01c285ad0
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 11:24:12 2022 +0200

    updated comment

commit f5721560f3fda05a8ad45d0b5e406434204c1177
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 11:16:02 2022 +0200

    unskip server predict test

commit 41d94199ede7d58427b9e9541605a94f962c3dc4
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 11:15:48 2022 +0200

    added hash image encoder that produces representations by hashing

commit 84a8b0a290081616240c3876f8db8a1ae8592096
Merge: 1624ee4 6030f40
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 10:18:56 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction

commit 1624ee40376b84a4519025343f913120c464407a
Author: Matthias Bisping <Matthias.Bisping@iqser.com>
Date:   Mon Apr 25 16:51:13 2022 +0200

    Pull request #11: fixed assignment

    Merge in RR/image-prediction from image_prediction_service_overhaul_xref_and_empty_result_fix_fix to master

    Squashed commit of the following:

    commit 7312e57d1127b081bfdc6e96311e8348d3f8110d
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 16:45:12 2022 +0200

        logging setup changed

    commit 955e353d74f414ee2d57b234bdf84d32817d14bf
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 16:37:52 2022 +0200

        fixed assignment
2022-05-12 11:49:19 +02:00
Matthias Bisping
6030f4055a Pull request #12: Image prediction service overhaul xref and empty result fix fix
Merge in RR/image-prediction from image_prediction_service_overhaul_xref_and_empty_result_fix_fix to master

Squashed commit of the following:

commit 1dfa95b3e2875d58d19639a2110ba50a46e949aa
Merge: c9cad0e eb050a5
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Thu May 12 10:13:40 2022 +0200

    Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction into image_prediction_service_overhaul_xref_and_empty_result_fix_fix

commit c9cad0eda55c32e4cb0b601679e39d4962b4b485
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 17:06:59 2022 +0200

    logging setup changed

commit 89e33618fe6b8e30a376d619395db6a6c664e218
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 17:01:44 2022 +0200

    logging setup changed

commit 7312e57d1127b081bfdc6e96311e8348d3f8110d
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:45:12 2022 +0200

    logging setup changed

commit 955e353d74f414ee2d57b234bdf84d32817d14bf
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:37:52 2022 +0200

    fixed assignment
2022-05-12 10:18:13 +02:00
Matthias Bisping
eb050a588b Pull request #11: fixed assignment
Merge in RR/image-prediction from image_prediction_service_overhaul_xref_and_empty_result_fix_fix to master

Squashed commit of the following:

commit 7312e57d1127b081bfdc6e96311e8348d3f8110d
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:45:12 2022 +0200

    logging setup changed

commit 955e353d74f414ee2d57b234bdf84d32817d14bf
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:37:52 2022 +0200

    fixed assignment
2022-04-25 16:51:13 +02:00
Matthias Bisping
d55f77e1fa Merge branch 'master' of ssh://git.iqser.com:2222/rr/image-prediction 2022-04-25 16:30:28 +02:00
Matthias Bisping
1e65d672d7 Pull request #10: fixed check for analysis result validity
Merge in RR/image-prediction from result_check_fix to master

Squashed commit of the following:

commit 8352e657ef1f399ca0fe6a89e7a4c7fc4bd0701d
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:23:24 2022 +0200

    logging setup changed

commit 956706378b7d7f6daa574b86eb29636797c05bba
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 16:17:19 2022 +0200

    error handling for bad xrefs

commit e7d229c0d70574cae316a841ab1377fae625ab15
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 14:07:19 2022 +0200

    fixed check for analysis result validity
2022-04-25 16:28:26 +02:00
Matthias Bisping
e7d229c0d7 fixed check for analysis result validity 2022-04-25 14:07:19 +02:00
Matthias Bisping
ddd8d4685e Pull request #9: Tdd refactoring
Merge in RR/image-prediction from tdd_refactoring to master

Squashed commit of the following:

commit f6c64430007590f5d2b234a7f784e26025d06484
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 12:18:47 2022 +0200

    renaming

commit 8f40b51282191edf3e2a5edcd6d6acb388ada453
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 12:07:18 2022 +0200

    adjusted expetced output for alpha channel in response

commit 7e666302d5eadb1e84b70cae27e8ec6108d7a135
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:52:51 2022 +0200

    added alpha channel check result to response

commit a6b9f64b51cd888fc0c427a38bd43ae2ae2cb051
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:27:57 2022 +0200

    readme updated

commit 0d06ad657e3c21dcef361c53df37b05aba64528b
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:19:35 2022 +0200

    readme updated and config

commit 75748a1d82f0ebdf3ad7d348c6d820c8858aa3cb
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:19:26 2022 +0200

    refactoring

commit 60101337828d11f5ee5fed0d8c4ec80cde536d8a
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:18:23 2022 +0200

    multiple reoutes for prediction

commit c8476cb5f55e470b831ae4557a031a2c1294eb86
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Mon Apr 25 11:17:49 2022 +0200

    add banner.txt to container

commit 26ef5fce8a9bc015f1c35f32d40e8bea50a96454
Author: Matthias Bisping <Matthias.Bisping@iqser.com>
Date:   Mon Apr 25 10:08:49 2022 +0200

    Pull request #8: Pipeline refactoring

    Merge in RR/image-prediction from pipeline_refactoring to tdd_refactoring

    Squashed commit of the following:

    commit 6989fcb3313007b7eecf4bba39077fcde6924a9a
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 09:49:49 2022 +0200

        removed obsolete module

    commit 7428aeee37b11c31cffa597c85b018ba71e79a1d
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 09:45:45 2022 +0200

        refactoring

    commit 0dcd3894154fdf34bd3ba4ef816362434474f472
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Mon Apr 25 08:57:21 2022 +0200

        refactoring; removed obsolete extractor-classifier

    commit 1078aa81144f4219149b3fcacdae8b09c4b905c0
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:18:10 2022 +0200

        removed obsolete imports

    commit 71f61fc5fc915da3941cf5ed5d9cc90fccc49031
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:16:25 2022 +0200

        comment changed

    commit b582726cd1de233edb55c5a76c91e99f9dd3bd13
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Fri Apr 22 17:12:11 2022 +0200

        refactoring

    commit 8abc9010048078868b235d6793ac6c8b20abb985
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:25:47 2022 +0200

        formatting

    commit 2c87c419fe3185a25c27139e7fcf79f60971ad24
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:24:05 2022 +0200

        formatting

    commit 50b161192db43a84464125c6d79650225e1010d6
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:20:18 2022 +0200

        refactoring

    commit 9a1446cccfa070852a5d9c0bdbc36037b82541fc
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 21:04:57 2022 +0200

        refactoring

    commit 6c10b55ff8e61412cb2fe5a5625e660ecaf1d7d1
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 19:48:05 2022 +0200

        refactoring

    commit 72e785e3e31c132ab352119e9921725f91fac9e2
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Thu Apr 21 19:43:39 2022 +0200

        refactoring

    commit f036ee55e6747daf31e3929bdc2d93dc5f2a56ca
    Author: Matthias Bisping <matthias.bisping@iqser.com>
    Date:   Wed Apr 20 18:30:41 2022 +0200

        refactoring pipeline WIP

commit 120721f5f1a7e910c0c2ebc79dc87c2908794c80
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:39:58 2022 +0200

    rm debug ls

commit 81226d4f8599af0db0e9718fbb1789cfad91a855
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:28:27 2022 +0200

    no compose down

commit 943f7799d49b6a6b0fed985a76ed4fe725dfaeef
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 15:22:17 2022 +0200

    coverage combine

commit d4cd96607157ea414db417cfd7133f56cb56afe1
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 14:43:09 2022 +0200

    model builder path in mlruns adjusted

commit 5b90bb47c3421feb6123c179eb68d1125d58ff1e
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:56:58 2022 +0200

    dvc pull in test running script

commit a935cacf2305a4a78a15ff571f368962f4538369
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:50:36 2022 +0200

    no clean working dir

commit ba09df7884485b8ab8efbf42a8058de9af60c75c
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 10:43:22 2022 +0200

    debug ls

commit 71263a9983dbfe2060ef5b74de7cc2cbbad43416
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Apr 20 09:11:03 2022 +0200

    debug ls

commit 41fbadc331e65e4ffe6d053e2d925e5e0543d8b7
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 20:08:08 2022 +0200

    debug echo

commit bb19698d640b3a99ea404e5b4b06d719a9bfe9e9
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 20:01:59 2022 +0200

    skip server predict test

commit 5094015a87fc0976c9d3ff5d1f4c6fdbd96b7eae
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Tue Apr 19 19:05:50 2022 +0200

    sonar stage after build stage

... and 253 more commits
2022-04-25 12:25:41 +02:00
Julius Unverfehrt
eb18ae8719 Pull request #5: Tests&Fixes
Merge in RR/image-prediction from tests to master

Squashed commit of the following:

commit 1776e3083c97025e699d579f936dd0cc6e1fe152
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 13:54:27 2022 +0100

    blacckkkyykykykyk

commit 4c9e6c38bdcea7d81008bf9dfcfcdd19d199da6a
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 13:53:40 2022 +0100

    add predicting as subprocess, add workaround for keras not working if the model was loaded in different process

commit 530de2ff8979c70aa22f06edf297864787e0cc79
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 13:36:23 2022 +0100

    refactor

commit 130d0e8b23e0375a6fd240ac8aa00492c341a716
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 13:34:54 2022 +0100

    add minimal not working example for keras bug in multiprocess process

commit 2589598b052f680fd702df4f60d56a55778474a9
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 11:13:45 2022 +0100

    test

commit eb6f211f02bc184e7f92d6b4d53c91da34ab9f2f
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 11:07:32 2022 +0100

    hardcoded test

commit 3e9bfac5cf9b2e09340e2c2c5b24a800925bcd60
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 11:01:21 2022 +0100

    test

commit 3d9c4d8856522cc2a22b2a7b9ea64d34629eb2c1
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 10:57:03 2022 +0100

    change test

commit 58ca784d6c56fd63734062d0c40b6b39550cf7d7
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 10:21:38 2022 +0100

    fix test

commit 6faad5ad5b6ef59bb5ef701b57d4c4addd17de0e
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Mon Mar 21 10:00:28 2022 +0100

    add predictor test

commit 3fbca0ac23821568a8afa904a8fb33ab0679f129
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Fri Mar 18 13:04:13 2022 +0100

    refactor folder structure

commit 90e3058c7124394a9f229d50278e57194f3d875d
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Fri Mar 18 12:58:02 2022 +0100

    add response test

commit 2a2deffd0b461ec5161009b3923623152f4c8f44
Author: Julius Unverfehrt <julius.unverfehrt@iqser.com>
Date:   Fri Mar 18 12:56:32 2022 +0100

    add test infrastructure
2022-03-23 11:49:05 +01:00
Matthias Bisping
a9d60654f5 Pull request #3: Refactoring
Merge in RR/image-prediction from refactoring to master

Squashed commit of the following:

commit fc4e2efac113f2e307fdbc091e0a4f4e3e5729d3
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 14:21:05 2022 +0100

    applied black

commit 3baabf5bc0b04347af85dafbb056f134258d9715
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 14:20:30 2022 +0100

    added banner

commit 30e871cfdc79d0ff2e0c26d1b858e55ab1b0453f
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 14:02:26 2022 +0100

    rename logger

commit d76fefd3ff0c4425defca4db218ce4a84c6053f3
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 14:00:39 2022 +0100

    logger refactoring

commit 0e004cbd21ab00b8804901952405fa870bf48e9c
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 14:00:08 2022 +0100

    logger refactoring

commit 49e113f8d85d7973b73f664779906a1347d1522d
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 13:25:08 2022 +0100

    refactoring

commit 7ec3d52e155cb83bed8804d2fee4f5bdf54fb59b
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 13:21:52 2022 +0100

    applied black

commit 06ea0be8aa9344e11b9d92fd526f2b73061bc736
Author: Matthias Bisping <matthias.bisping@iqser.com>
Date:   Wed Mar 16 13:21:20 2022 +0100

    refactoring
2022-03-16 15:07:30 +01:00
174 changed files with 45575 additions and 788 deletions

63
.coveragerc Normal file
View File

@ -0,0 +1,63 @@
# .coveragerc to control coverage.py
[run]
branch = True
parallel = True
command_line = -m pytest
concurrency = multiprocessing
omit =
*/site-packages/*
*/distutils/*
*/test/*
*/__init__.py
*/setup.py
*/venv/*
*/env/*
*/build_venv/*
*/build_env/*
*/utils/banner.py
*/utils/logger.py
*/src/*
source =
image_prediction
relative_files = True
data_file = .coverage
[report]
# Regexes for lines to exclude from consideration
exclude_lines =
# Have to re-enable the standard pragma
pragma: no cover
# Don't complain about missing debug-only code:
def __repr__
if self\.debug
# Don't complain if tests don't hit defensive assertion code:
raise AssertionError
raise NotImplementedError
# Don't complain if non-runnable code isn't run:
if 0:
if __name__ == .__main__.:
omit =
*/site-packages/*
*/distutils/*
*/test/*
*/__init__.py
*/setup.py
*/venv/*
*/env/*
*/build_venv/*
*/build_env/*
*/utils/banner.py
*/utils/logger.py
*/src/*
*/pdf_annotation.py
ignore_errors = True
[html]
directory = reports
[xml]
output = reports/coverage.xml

View File

@ -1,5 +1,8 @@
[core]
remote = vector
remote = azure_remote
autostage = true
['remote "vector"']
url = ssh://vector.iqser.com/research/image_service/
url = ssh://vector.iqser.com/research/image-prediction/
port = 22
['remote "azure_remote"']
url = azure://image-classification-dvc/

11
.gitignore vendored
View File

@ -1,7 +1,8 @@
.vscode/
*.h5
/venv/
*venv
.idea/
src/data
!.gitignore
*.project
@ -32,6 +33,9 @@
**/classpath-data.json
**/dependencies-and-licenses-overview.txt
.coverage
.coverage\.*\.*
*__pycache__
*.egg-info*
@ -44,7 +48,6 @@
*misc
/coverage_html_report/
.coverage
# Created by https://www.toptal.com/developers/gitignore/api/linux,pycharm
# Edit at https://www.toptal.com/developers/gitignore?templates=linux,pycharm
@ -170,6 +173,4 @@ fabric.properties
# https://plugins.jetbrains.com/plugin/12206-codestream
.idea/codestream.xml
# End of https://www.toptal.com/developers/gitignore/api/linux,pycharm
/image_prediction/data/mlruns/
/data/mlruns/
# End of https://www.toptal.com/developers/gitignore/api/linux,pycharm

51
.gitlab-ci.yml Normal file
View File

@ -0,0 +1,51 @@
include:
- project: "Gitlab/gitlab"
ref: main
file: "/ci-templates/research/dvc.gitlab-ci.yml"
- project: "Gitlab/gitlab"
ref: main
file: "/ci-templates/research/versioning-build-test-release.gitlab-ci.yml"
variables:
NEXUS_PROJECT_DIR: red
IMAGENAME: "${CI_PROJECT_NAME}"
INTEGRATION_TEST_FILE: "${CI_PROJECT_ID}.pdf"
FF_USE_FASTZIP: "true" # enable fastzip - a faster zip implementation that also supports level configuration.
ARTIFACT_COMPRESSION_LEVEL: default # can also be set to fastest, fast, slow and slowest. If just enabling fastzip is not enough try setting this to fastest or fast.
CACHE_COMPRESSION_LEVEL: default # same as above, but for caches
# TRANSFER_METER_FREQUENCY: 5s # will display transfer progress every 5 seconds for artifacts and remote caches. For debugging purposes.
stages:
- data
- setup
- tests
- sonarqube
- versioning
- build
- integration-tests
- release
docker-build:
extends: .docker-build
needs:
- job: dvc-pull
artifacts: true
- !reference [.needs-versioning, needs] # leave this line as is
###################
# INTEGRATION TESTS
trigger-integration-tests:
extends: .integration-tests
# ADD THE MODEL BUILD WHICH SHOULD TRIGGER THE INTEGRATION TESTS
# needs:
# - job: docker-build::model_name
# artifacts: true
rules:
- when: never
#########
# RELEASE
release:
extends: .release
needs:
- !reference [.needs-versioning, needs] # leave this line as is

3
.gitmodules vendored
View File

@ -1,3 +0,0 @@
[submodule "incl/redai_image"]
path = incl/redai_image
url = ssh://git@git.iqser.com:2222/rr/redai_image.git

1
.python-version Normal file
View File

@ -0,0 +1 @@
3.10

View File

@ -1,25 +1,73 @@
ARG BASE_ROOT="nexus.iqser.com:5001/red/"
ARG VERSION_TAG="latest"
FROM python:3.10-slim AS builder
FROM ${BASE_ROOT}image-prediction-base:${VERSION_TAG}
ARG GITLAB_USER
ARG GITLAB_ACCESS_TOKEN
WORKDIR /app/service
ARG PYPI_REGISTRY_RESEARCH=https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi
ARG POETRY_SOURCE_REF_RESEARCH=gitlab-research
COPY src src
COPY data data
COPY image_prediction image_prediction
COPY incl/redai_image/redai incl/redai_image/redai
COPY setup.py setup.py
COPY requirements.txt requirements.txt
COPY config.yaml config.yaml
ARG PYPI_REGISTRY_RED=https://gitlab.knecon.com/api/v4/groups/12/-/packages/pypi
ARG POETRY_SOURCE_REF_RED=gitlab-red
# Install dependencies differing from base image.
RUN python3 -m pip install -r requirements.txt
ARG PYPI_REGISTRY_FFORESIGHT=https://gitlab.knecon.com/api/v4/groups/269/-/packages/pypi
ARG POETRY_SOURCE_REF_FFORESIGHT=gitlab-fforesight
RUN python3 -m pip install -e .
RUN python3 -m pip install -e incl/redai_image/redai
ARG VERSION=dev
LABEL maintainer="Research <research@knecon.com>"
LABEL version="${VERSION}"
WORKDIR /app
###########
# ENV SETUP
ENV PYTHONDONTWRITEBYTECODE=true
ENV PYTHONUNBUFFERED=true
ENV POETRY_HOME=/opt/poetry
ENV PATH="$POETRY_HOME/bin:$PATH"
RUN apt-get update && \
apt-get install -y curl git bash build-essential libffi-dev libssl-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN curl -sSL https://install.python-poetry.org | python3 -
RUN poetry --version
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.create true && \
poetry config virtualenvs.in-project true && \
poetry config installer.max-workers 10 && \
poetry config repositories.${POETRY_SOURCE_REF_RESEARCH} ${PYPI_REGISTRY_RESEARCH} && \
poetry config http-basic.${POETRY_SOURCE_REF_RESEARCH} ${GITLAB_USER} ${GITLAB_ACCESS_TOKEN} && \
poetry config repositories.${POETRY_SOURCE_REF_RED} ${PYPI_REGISTRY_RED} && \
poetry config http-basic.${POETRY_SOURCE_REF_RED} ${GITLAB_USER} ${GITLAB_ACCESS_TOKEN} && \
poetry config repositories.${POETRY_SOURCE_REF_FFORESIGHT} ${PYPI_REGISTRY_FFORESIGHT} && \
poetry config http-basic.${POETRY_SOURCE_REF_FFORESIGHT} ${GITLAB_USER} ${GITLAB_ACCESS_TOKEN} && \
poetry install --without=dev -vv --no-interaction --no-root
###############
# WORKING IMAGE
FROM python:3.10-slim
WORKDIR /app
# COPY SOURCE CODE FROM BUILDER IMAGE
COPY --from=builder /app /app
# COPY BILL OF MATERIALS (BOM)
COPY bom.json /bom.json
ENV PATH="/app/.venv/bin:$PATH"
###################
# COPY SOURCE CODE
COPY ./src ./src
COPY ./config ./config
COPY ./data ./data
COPY banner.txt ./
EXPOSE 5000
EXPOSE 8080
CMD ["python3", "src/serve.py"]
CMD [ "python", "src/serve.py"]

View File

@ -1,25 +0,0 @@
FROM python:3.8 as builder1
# Use a virtual environment.
RUN python -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
# Upgrade pip.
RUN python -m pip install --upgrade pip
# Make a directory for the service files and copy the service repo into the container.
WORKDIR /app/service
COPY ./requirements.txt ./requirements.txt
# Install dependencies.
RUN python3 -m pip install -r requirements.txt
# Make a new container and copy all relevant files over to filter out temporary files
# produced during setup to reduce the final container's size.
FROM python:3.8
WORKDIR /app/
COPY --from=builder1 /app .
ENV PATH="/app/venv/bin:$PATH"
WORKDIR /app/service

43
Dockerfile_tests Normal file
View File

@ -0,0 +1,43 @@
FROM python:3.10
ARG USERNAME
ARG TOKEN
ARG PYPI_REGISTRY_RESEARCH=https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi
ARG POETRY_SOURCE_REF_RESEARCH=gitlab-research
ARG PYPI_REGISTRY_RED=https://gitlab.knecon.com/api/v4/groups/12/-/packages/pypi
ARG POETRY_SOURCE_REF_RED=gitlab-red
ARG VERSION=dev
LABEL maintainer="Research <research@knecon.com>"
LABEL version="${VERSION}"
WORKDIR /app
ENV PYTHONUNBUFFERED=true
ENV POETRY_HOME=/opt/poetry
ENV PATH="$POETRY_HOME/bin:$PATH"
RUN curl -sSL https://install.python-poetry.org | python3 -
COPY ./data ./data
COPY ./test ./test
COPY ./config ./config
COPY ./src ./src
COPY pyproject.toml poetry.lock banner.txt config.yaml./
RUN poetry config virtualenvs.create false && \
poetry config installer.max-workers 10 && \
poetry config repositories.${POETRY_SOURCE_REF_RESEARCH} ${PYPI_REGISTRY_RESEARCH} && \
poetry config http-basic.${POETRY_SOURCE_REF_RESEARCH} ${USERNAME} ${TOKEN} && \
poetry config repositories.${POETRY_SOURCE_REF_RED} ${PYPI_REGISTRY_RED} && \
poetry config http-basic.${POETRY_SOURCE_REF_RED} ${USERNAME} ${TOKEN} && \
poetry install --without=dev -vv --no-interaction --no-root
EXPOSE 5000
EXPOSE 8080
RUN apt update --yes
RUN apt install vim --yes
RUN apt install poppler-utils --yes
CMD coverage run -m pytest test/ --tb=native -q -s -vvv -x && coverage combine && coverage report -m && coverage xml

136
README.md
View File

@ -1,25 +1,143 @@
### Building
### Setup
Build base image
```bash
setup/docker.sh
```
Build head image
```bash
docker build -f Dockerfile -t image-prediction . --build-arg BASE_ROOT=""
docker build -t image-classification-image --progress=plain --no-cache \
-f Dockerfile \
--build-arg USERNAME=$GITLAB_USER \
--build-arg TOKEN=$GITLAB_ACCESS_TOKEN \
.
```
### Usage
#### Without Docker
```bash
py scripts/run_pipeline.py /path/to/a/pdf
```
#### With Docker
Shell 1
```bash
docker run --rm --net=host --rm image-prediction
docker run --rm --net=host image-prediction
```
Shell 2
```bash
python scripts/pyinfra_mock.py --pdf_path /path/to/a/pdf
python scripts/pyinfra_mock.py /path/to/a/pdf
```
### Tests
Run for example this command to execute all tests and get a coverage report:
```bash
coverage run -m pytest test --tb=native -q -s -vvv -x && coverage combine && coverage report -m
```
After having built the service container as specified above, you can also run tests in a container as follows:
```bash
./run_tests.sh
```
### Message Body Formats
#### Request Format
The request messages need to provide the fields `"dossierId"` and `"fileId"`. A request should look like this:
```json
{
"dossierId": "<string identifier>",
"fileId": "<string identifier>"
}
```
Any additional keys are ignored.
#### Response Format
Response bodies contain information about the identified class of the image, the confidence of the classification, the
position and size of the image as well as the results of additional convenience filters which can be configured through
environment variables. A response body looks like this:
```json
{
"dossierId": "debug",
"fileId": "13ffa9851740c8d20c4c7d1706d72f2a",
"data": [...]
}
```
An image metadata record (entry in `"data"` field of a response body) looks like this:
```json
{
"classification": {
"label": "logo",
"probabilities": {
"logo": 1.0,
"signature": 1.1599173226749333e-17,
"other": 2.994595513398207e-23,
"formula": 4.352109377281029e-31
}
},
"position": {
"x1": 475.95,
"x2": 533.4,
"y1": 796.47,
"y2": 827.62,
"pageNumber": 6
},
"geometry": {
"width": 57.44999999999999,
"height": 31.149999999999977
},
"alpha": false,
"filters": {
"geometry": {
"imageSize": {
"quotient": 0.05975350599135938,
"tooLarge": false,
"tooSmall": false
},
"imageFormat": {
"quotient": 1.8443017656500813,
"tooTall": false,
"tooWide": false
}
},
"probability": {
"unconfident": false
},
"allPassed": true
}
}
```
## Configuration
A configuration file is located under `config.yaml`. All relevant variables can be configured via
exporting environment variables.
| __Environment Variable__ | Default | Description |
|------------------------------------|------------------------------------|----------------------------------------------------------------------------------------|
| __LOGGING_LEVEL_ROOT__ | "INFO" | Logging level for log file messages |
| __VERBOSE__ | *true* | Service prints document processing progress to stdout |
| __BATCH_SIZE__ | 16 | Number of images in memory simultaneously per service instance |
| __RUN_ID__ | "fabfb1f192c745369b88cab34471aba7" | The ID of the mlflow run to load the image classifier from |
| __MIN_REL_IMAGE_SIZE__ | 0.05 | Minimally permissible image size to page size ratio |
| __MAX_REL_IMAGE_SIZE__ | 0.75 | Maximally permissible image size to page size ratio |
| __MIN_IMAGE_FORMAT__ | 0.1 | Minimally permissible image width to height ratio |
| __MAX_IMAGE_FORMAT__ | 10 | Maximally permissible image width to height ratio |
See also: https://git.iqser.com/projects/RED/repos/helm/browse/redaction/templates/image-service-v2

View File

@ -1,40 +0,0 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.atlassian.bamboo</groupId>
<artifactId>bamboo-specs-parent</artifactId>
<version>7.1.2</version>
<relativePath/>
</parent>
<artifactId>bamboo-specs</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<sonar.skip>true</sonar.skip>
</properties>
<dependencies>
<dependency>
<groupId>com.atlassian.bamboo</groupId>
<artifactId>bamboo-specs-api</artifactId>
</dependency>
<dependency>
<groupId>com.atlassian.bamboo</groupId>
<artifactId>bamboo-specs</artifactId>
</dependency>
<!-- Test dependencies -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<!-- run 'mvn test' to perform offline validation of the plan -->
<!-- run 'mvn -Ppublish-specs' to upload the plan to your Bamboo server -->
</project>

View File

@ -1,182 +0,0 @@
package buildjob;
import com.atlassian.bamboo.specs.api.BambooSpec;
import com.atlassian.bamboo.specs.api.builders.BambooKey;
import com.atlassian.bamboo.specs.api.builders.docker.DockerConfiguration;
import com.atlassian.bamboo.specs.api.builders.permission.PermissionType;
import com.atlassian.bamboo.specs.api.builders.permission.Permissions;
import com.atlassian.bamboo.specs.api.builders.permission.PlanPermissions;
import com.atlassian.bamboo.specs.api.builders.plan.Job;
import com.atlassian.bamboo.specs.api.builders.plan.Plan;
import com.atlassian.bamboo.specs.api.builders.plan.PlanIdentifier;
import com.atlassian.bamboo.specs.api.builders.plan.Stage;
import com.atlassian.bamboo.specs.api.builders.plan.branches.BranchCleanup;
import com.atlassian.bamboo.specs.api.builders.plan.branches.PlanBranchManagement;
import com.atlassian.bamboo.specs.api.builders.project.Project;
import com.atlassian.bamboo.specs.builders.task.CheckoutItem;
import com.atlassian.bamboo.specs.builders.task.InjectVariablesTask;
import com.atlassian.bamboo.specs.builders.task.ScriptTask;
import com.atlassian.bamboo.specs.builders.task.VcsCheckoutTask;
import com.atlassian.bamboo.specs.builders.task.CleanWorkingDirectoryTask;
import com.atlassian.bamboo.specs.builders.task.VcsTagTask;
import com.atlassian.bamboo.specs.builders.trigger.BitbucketServerTrigger;
import com.atlassian.bamboo.specs.model.task.InjectVariablesScope;
import com.atlassian.bamboo.specs.api.builders.Variable;
import com.atlassian.bamboo.specs.util.BambooServer;
import com.atlassian.bamboo.specs.builders.task.ScriptTask;
import com.atlassian.bamboo.specs.model.task.ScriptTaskProperties.Location;
/**
* Plan configuration for Bamboo.
* Learn more on: <a href="https://confluence.atlassian.com/display/BAMBOO/Bamboo+Specs">https://confluence.atlassian.com/display/BAMBOO/Bamboo+Specs</a>
*/
@BambooSpec
public class PlanSpec {
private static final String SERVICE_NAME = "image-prediction";
private static final String SERVICE_NAME_BASE = "image-prediction-base";
private static final String SERVICE_KEY = SERVICE_NAME.toUpperCase().replaceAll("-","").replaceAll("_","");
/**
* Run main to publish plan on Bamboo
*/
public static void main(final String[] args) throws Exception {
//By default credentials are read from the '.credentials' file.
BambooServer bambooServer = new BambooServer("http://localhost:8085");
Plan plan = new PlanSpec().createDockerBuildPlan();
bambooServer.publish(plan);
PlanPermissions planPermission = new PlanSpec().createPlanPermission(plan.getIdentifier());
bambooServer.publish(planPermission);
}
private PlanPermissions createPlanPermission(PlanIdentifier planIdentifier) {
Permissions permission = new Permissions()
.userPermissions("atlbamboo", PermissionType.EDIT, PermissionType.VIEW, PermissionType.ADMIN, PermissionType.CLONE, PermissionType.BUILD)
.groupPermissions("research", PermissionType.EDIT, PermissionType.VIEW, PermissionType.CLONE, PermissionType.BUILD)
.groupPermissions("Development", PermissionType.EDIT, PermissionType.VIEW, PermissionType.CLONE, PermissionType.BUILD)
.groupPermissions("QA", PermissionType.EDIT, PermissionType.VIEW, PermissionType.CLONE, PermissionType.BUILD)
.loggedInUserPermissions(PermissionType.VIEW)
.anonymousUserPermissionView();
return new PlanPermissions(planIdentifier.getProjectKey(), planIdentifier.getPlanKey()).permissions(permission);
}
private Project project() {
return new Project()
.name("RED")
.key(new BambooKey("RED"));
}
public Plan createDockerBuildPlan() {
return new Plan(
project(),
SERVICE_NAME, new BambooKey(SERVICE_KEY))
.description("Docker build for image-prediction.")
// .variables()
.stages(new Stage("Build Stage")
.jobs(
new Job("Build Job", new BambooKey("BUILD"))
.tasks(
new CleanWorkingDirectoryTask()
.description("Clean working directory.")
.enabled(true),
new VcsCheckoutTask()
.description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()),
new VcsCheckoutTask()
.description("Checkout redai_image research repository.")
.checkoutItems(new CheckoutItem().repository("RR / redai_image").path("redai_image")),
new ScriptTask()
.description("Set config and keys.")
.inlineBody("mkdir -p ~/.ssh\n" +
"echo \"${bamboo.bamboo_agent_ssh}\" | base64 -d >> ~/.ssh/id_rsa\n" +
"echo \"host vector.iqser.com\" > ~/.ssh/config\n" +
"echo \" user bamboo-agent\" >> ~/.ssh/config\n" +
"chmod 600 ~/.ssh/config ~/.ssh/id_rsa"),
new ScriptTask()
.description("Build Docker container.")
.location(Location.FILE)
.fileFromPath("bamboo-specs/src/main/resources/scripts/docker-build.sh")
.argument(SERVICE_NAME + " " + SERVICE_NAME_BASE))
.dockerConfiguration(
new DockerConfiguration()
.image("nexus.iqser.com:5001/infra/release_build:4.2.0")
.volume("/var/run/docker.sock", "/var/run/docker.sock")),
new Job("Sonar Job", new BambooKey("SONAR"))
.tasks(
new CleanWorkingDirectoryTask()
.description("Clean working directory.")
.enabled(true),
new VcsCheckoutTask()
.description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()),
new VcsCheckoutTask()
.description("Checkout redai_image repository.")
.checkoutItems(new CheckoutItem().repository("RR / redai_image").path("redai_image")),
new ScriptTask()
.description("Set config and keys.")
.inlineBody("mkdir -p ~/.ssh\n" +
"echo \"${bamboo.bamboo_agent_ssh}\" | base64 -d >> ~/.ssh/id_rsa\n" +
"echo \"host vector.iqser.com\" > ~/.ssh/config\n" +
"echo \" user bamboo-agent\" >> ~/.ssh/config\n" +
"chmod 600 ~/.ssh/config ~/.ssh/id_rsa"),
new ScriptTask()
.description("Run Sonarqube scan.")
.location(Location.FILE)
.fileFromPath("bamboo-specs/src/main/resources/scripts/sonar-scan.sh")
.argument(SERVICE_NAME))
.dockerConfiguration(
new DockerConfiguration()
.image("nexus.iqser.com:5001/infra/release_build:4.2.0")
.volume("/var/run/docker.sock", "/var/run/docker.sock"))),
new Stage("Licence Stage")
.jobs(
new Job("Git Tag Job", new BambooKey("GITTAG"))
.tasks(
new VcsCheckoutTask()
.description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()),
new ScriptTask()
.description("Build git tag.")
.location(Location.FILE)
.fileFromPath("bamboo-specs/src/main/resources/scripts/git-tag.sh"),
new InjectVariablesTask()
.description("Inject git tag.")
.path("git.tag")
.namespace("g")
.scope(InjectVariablesScope.LOCAL),
new VcsTagTask()
.description("${bamboo.g.gitTag}")
.tagName("${bamboo.g.gitTag}")
.defaultRepository())
.dockerConfiguration(
new DockerConfiguration()
.image("nexus.iqser.com:5001/infra/release_build:4.4.1")),
new Job("Licence Job", new BambooKey("LICENCE"))
.enabled(false)
.tasks(
new VcsCheckoutTask()
.description("Checkout default repository.")
.checkoutItems(new CheckoutItem().defaultRepository()),
new ScriptTask()
.description("Build licence.")
.location(Location.FILE)
.fileFromPath("bamboo-specs/src/main/resources/scripts/create-licence.sh"))
.dockerConfiguration(
new DockerConfiguration()
.image("nexus.iqser.com:5001/infra/maven:3.6.2-jdk-13-3.0.0")
.volume("/etc/maven/settings.xml", "/usr/share/maven/ref/settings.xml")
.volume("/var/run/docker.sock", "/var/run/docker.sock"))))
.linkedRepositories("RR / " + SERVICE_NAME)
.linkedRepositories("RR / redai_image")
.triggers(new BitbucketServerTrigger())
.planBranchManagement(new PlanBranchManagement()
.createForVcsBranch()
.delete(new BranchCleanup()
.whenInactiveInRepositoryAfterDays(14))
.notificationForCommitters());
}
}

View File

@ -1,19 +0,0 @@
#!/bin/bash
set -e
if [[ \"${bamboo_version_tag}\" != \"dev\" ]]
then
${bamboo_capability_system_builder_mvn3_Maven_3}/bin/mvn \
-f ${bamboo_build_working_directory}/pom.xml \
versions:set \
-DnewVersion=${bamboo_version_tag}
${bamboo_capability_system_builder_mvn3_Maven_3}/bin/mvn \
-f ${bamboo_build_working_directory}/pom.xml \
-B clean deploy \
-e -DdeployAtEnd=true \
-Dmaven.wagon.http.ssl.insecure=true \
-Dmaven.wagon.http.ssl.allowall=true \
-Dmaven.wagon.http.ssl.ignore.validity.dates=true \
-DaltDeploymentRepository=iqser_release::default::https://nexus.iqser.com/repository/gin4-platform-releases
fi

View File

@ -1,19 +0,0 @@
#!/bin/bash
set -e
SERVICE_NAME=$1
SERVICE_NAME_BASE=$2
python3 -m venv build_venv
source build_venv/bin/activate
python3 -m pip install --upgrade pip
pip install dvc
pip install 'dvc[ssh]'
dvc pull
echo "index-url = https://${bamboo_nexus_user}:${bamboo_nexus_password}@nexus.iqser.com/repository/python-combind/simple" >> pip.conf
docker build -f Dockerfile_base -t nexus.iqser.com:5001/red/$SERVICE_NAME_BASE:${bamboo_version_tag} .
docker build -f Dockerfile -t nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag} --build-arg VERSION_TAG=${bamboo_version_tag} .
echo "${bamboo_nexus_password}" | docker login --username "${bamboo_nexus_user}" --password-stdin nexus.iqser.com:5001
docker push nexus.iqser.com:5001/red/$SERVICE_NAME:${bamboo_version_tag}

View File

@ -1,9 +0,0 @@
#!/bin/bash
set -e
if [[ "${bamboo_version_tag}" = "dev" ]]
then
echo "gitTag=${bamboo_planRepository_1_branch}_${bamboo_buildNumber}" > git.tag
else
echo "gitTag=${bamboo_version_tag}" > git.tag
fi

View File

@ -1,51 +0,0 @@
#!/bin/bash
set -e
export JAVA_HOME=/usr/bin/sonar-scanner/jre
python3 -m venv build_venv
source build_venv/bin/activate
python3 -m pip install --upgrade pip
echo "dev setup for unit test and coverage 💖"
pip install -e .
pip install -r requirements.txt
SERVICE_NAME=$1
echo "dependency-check:aggregate"
mkdir -p reports
dependency-check --enableExperimental -f JSON -f HTML -f XML \
--disableAssembly -s . -o reports --project $SERVICE_NAME --exclude ".git/**" --exclude "venv/**" \
--exclude "build_venv/**" --exclude "**/__pycache__/**" --exclude "bamboo-specs/**"
if [[ -z "${bamboo_repository_pr_key}" ]]
then
echo "Sonar Scan for branch: ${bamboo_planRepository_1_branch}"
/usr/bin/sonar-scanner/bin/sonar-scanner \
-Dsonar.projectKey=RED_$SERVICE_NAME \
-Dsonar.sources=image_prediction \
-Dsonar.host.url=https://sonarqube.iqser.com \
-Dsonar.login=${bamboo_sonarqube_api_token_secret} \
-Dsonar.branch.name=${bamboo_planRepository_1_branch} \
-Dsonar.dependencyCheck.jsonReportPath=reports/dependency-check-report.json \
-Dsonar.dependencyCheck.xmlReportPath=reports/dependency-check-report.xml \
-Dsonar.dependencyCheck.htmlReportPath=reports/dependency-check-report.html \
-Dsonar.python.coverage.reportPaths=reports/coverage.xml
else
echo "Sonar Scan for PR with key1: ${bamboo_repository_pr_key}"
/usr/bin/sonar-scanner/bin/sonar-scanner \
-Dsonar.projectKey=RED_$SERVICE_NAME \
-Dsonar.sources=image_prediction \
-Dsonar.host.url=https://sonarqube.iqser.com \
-Dsonar.login=${bamboo_sonarqube_api_token_secret} \
-Dsonar.pullrequest.key=${bamboo_repository_pr_key} \
-Dsonar.pullrequest.branch=${bamboo_repository_pr_sourceBranch} \
-Dsonar.pullrequest.base=${bamboo_repository_pr_targetBranch} \
-Dsonar.dependencyCheck.jsonReportPath=reports/dependency-check-report.json \
-Dsonar.dependencyCheck.xmlReportPath=reports/dependency-check-report.xml \
-Dsonar.dependencyCheck.htmlReportPath=reports/dependency-check-report.html \
-Dsonar.python.coverage.reportPaths=reports/coverage.xml
fi

View File

@ -1,16 +0,0 @@
package buildjob;
import com.atlassian.bamboo.specs.api.builders.plan.Plan;
import com.atlassian.bamboo.specs.api.exceptions.PropertiesValidationException;
import com.atlassian.bamboo.specs.api.util.EntityPropertiesBuilders;
import org.junit.Test;
public class PlanSpecTest {
@Test
public void checkYourPlanOffline() throws PropertiesValidationException {
Plan plan = new PlanSpec().createDockerBuildPlan();
EntityPropertiesBuilders.build(plan);
}
}

11
banner.txt Normal file
View File

@ -0,0 +1,11 @@
+----------------------------------------------------+
| ___ |
| __/_ `. .-"""-. |
|_._ _,-'""`-._ \_,` | \-' / )`-')|
|(,-.`._,'( |\`-/| "") `"` \ ((`"` |
| `-.-' \ )-`( , o o) ___Y , .'7 /| |
| `- \`_`"'- (_,___/...-` (_/_/ |
| |
+----------------------------------------------------+
| Image Classification Service |
+----------------------------------------------------+

33697
bom.json Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,28 +0,0 @@
webserver:
host: $SERVER_HOST|"127.0.0.1" # webserver address
port: $SERVER_PORT|5000 # webserver port
mode: $SERVER_MODE|production # webserver mode: {development, production}
service:
logging_level: $LOGGING_LEVEL_ROOT|DEBUG # Logging level for service logger
batch_size: $BATCH_SIZE|32 # Number of images in memory simultaneously
verbose: $VERBOSE|True # Service prints document processing progress to stdout
run_id: $RUN_ID|fabfb1f192c745369b88cab34471aba7 # The ID of the mlflow run to load the model from
# These variables control filters that are applied to either images, image metadata or model predictions. The filter
# result values are reported in the service responses. For convenience the response to a request contains a
# "filters.allPassed" field, which is set to false if any of the filters returned values did not meet its specified
# required value.
filters:
image_to_page_quotient: # Image size to page size ratio (ratio of geometric means of areas)
min: $MIN_REL_IMAGE_SIZE|0.05 # Minimum permissible
max: $MAX_REL_IMAGE_SIZE|0.75 # Maximum permissible
image_width_to_height_quotient: # Image width to height ratio
min: $MIN_IMAGE_FORMAT|0.1 # Minimum permissible
max: $MAX_IMAGE_FORMAT|10 # Maximum permissible
min_confidence: $MIN_CONFIDENCE|0.5 # Minimum permissible prediction confidence

68
config/pyinfra.toml Normal file
View File

@ -0,0 +1,68 @@
[asyncio]
max_concurrent_tasks = 10
[dynamic_tenant_queues]
enabled = true
[metrics.prometheus]
enabled = true
prefix = "redactmanager_image_service"
[tracing]
enabled = true
# possible values "opentelemetry" | "azure_monitor" (Excpects APPLICATIONINSIGHTS_CONNECTION_STRING environment variable.)
type = "azure_monitor"
[tracing.opentelemetry]
endpoint = "http://otel-collector-opentelemetry-collector.otel-collector:4318/v1/traces"
service_name = "redactmanager_image_service"
exporter = "otlp"
[webserver]
host = "0.0.0.0"
port = 8080
[rabbitmq]
host = "localhost"
port = 5672
username = ""
password = ""
heartbeat = 60
# Has to be a divider of heartbeat, and shouldn't be too big, since only in these intervals queue interactions happen (like receiving new messages)
# This is also the minimum time the service needs to process a message
connection_sleep = 5
input_queue = "request_queue"
output_queue = "response_queue"
dead_letter_queue = "dead_letter_queue"
tenant_event_queue_suffix = "_tenant_event_queue"
tenant_event_dlq_suffix = "_tenant_events_dlq"
tenant_exchange_name = "tenants-exchange"
queue_expiration_time = 300000 # 5 minutes in milliseconds
service_request_queue_prefix = "image_request_queue"
service_request_exchange_name = "image_request_exchange"
service_response_exchange_name = "image_response_exchange"
service_dlq_name = "image_dlq"
[storage]
backend = "s3"
[storage.s3]
bucket = "redaction"
endpoint = "http://127.0.0.1:9000"
key = ""
secret = ""
region = "eu-central-1"
[storage.azure]
container = "redaction"
connection_string = ""
[storage.tenant_server]
public_key = ""
endpoint = "http://tenant-user-management:8081/internal-api/tenants"
[kubernetes]
pod_name = "test_pod"

42
config/settings.toml Normal file
View File

@ -0,0 +1,42 @@
[logging]
level = "INFO"
[service]
# Print document processing progress to stdout
verbose = false
batch_size = 6
image_stiching_tolerance = 1 # in pixels
mlflow_run_id = "fabfb1f192c745369b88cab34471aba7"
# These variables control filters that are applied to either images, image metadata or service_estimator predictions.
# The filter result values are reported in the service responses. For convenience the response to a request contains a
# "filters.allPassed" field, which is set to false if any of the values returned by the filters did not meet its
# specified required value.
[filters.confidence]
# Minimum permissible prediction confidence
min = 0.5
# Image size to page size ratio (ratio of geometric means of areas)
[filters.image_to_page_quotient]
min = 0.05
max = 0.75
[filters.is_scanned_page]
# Minimum permissible image to page ratio tolerance for a page to be considered scanned.
# This is only used for filtering small images on scanned pages and is applied before processing the image, therefore
# superseding the image_to_page_quotient filter that only applies a tag to the image after processing.
tolerance = 0
# Image width to height ratio
[filters.image_width_to_height_quotient]
min = 0.1
max = 10
# put class specific filters here ['signature', 'formula', 'logo']
[filters.overrides.signature.image_to_page_quotient]
max = 0.4
[filters.overrides.logo.image_to_page_quotient]
min = 0.06

1
data/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
/mlruns

View File

@ -1,4 +0,0 @@
outs:
- md5: 6d0186c1f25e889d531788f168fa6cf0
size: 16727296
path: base_weights.h5

View File

@ -1,5 +1,5 @@
outs:
- md5: d1c708270bab6fcd344d4a8b05d1103d.dir
size: 150225383
nfiles: 178
- md5: ad061d607f615afc149643f62dbf37cc.dir
size: 166952700
nfiles: 179
path: mlruns

1
doc/tests.drawio Normal file
View File

@ -0,0 +1 @@
<mxfile host="app.diagrams.net" modified="2022-03-17T15:35:10.371Z" agent="5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36" etag="b-CbBXg6FXQ9T3Px-oLc" version="17.1.1" type="device"><diagram id="tS3WR_Pr6QhNVK3FqSUP" name="Page-1">1ZZRT6QwEMc/DY8mQHdRX93z9JLbmNzGmNxbQ0daLQzpDrL46a/IsCzinneJcd0XaP+dtsN/fkADscg3V06WeokKbBCHahOIb0Ecnydzf22FphPmyXknZM6oTooGYWWegcWQ1cooWI8CCdGSKcdiikUBKY006RzW47B7tONdS5nBRFil0k7VO6NId+rZPBz0azCZ7neOQh7JZR/MwlpLhfWOJC4DsXCI1LXyzQJs613vSzfv+57RbWIOCvqXCZqW9PBref27aZ7xsQ5vTn/cnvAqT9JW/MCwJuNzR8dZU9Nb4bAqFLSrhYG4qLUhWJUybUdrX3uvacqt70W+yeuCI9jsTTja2uDxAcyBXONDeILonWN04hn366EQUR+jd4qQsCa59tl26cEe32CH/sOt+TueoCONGRbS/kQs2YkHIGoYbFkRvuUTqAmFr1zyu2LlUvhLdjG/HtJlQO/VfOq6AyvJPI3z+HAL4wlwpbp/2V0qODxzUTJmLjo4c8nEkxaWFXcLLPzt4ithKI4BQzHBMOc/l8UvAeLrj9/hQTw9NhBnxwDibB+IB+ZvdvZ5/PnucAx6Gds5S4rLPw==</diagram></mxfile>

View File

@ -1,40 +0,0 @@
"""Implements a config object with dot-indexing syntax."""
from envyaml import EnvYAML
from image_prediction.locations import CONFIG_FILE
def _get_item_and_maybe_make_dotindexable(container, item):
ret = container[item]
return DotIndexable(ret) if isinstance(ret, dict) else ret
class DotIndexable:
def __init__(self, x):
self.x = x
def __getattr__(self, item):
return _get_item_and_maybe_make_dotindexable(self.x, item)
def __setitem__(self, key, value):
self.x[key] = value
def __repr__(self):
return self.x.__repr__()
class Config:
def __init__(self, config_path):
self.__config = EnvYAML(config_path)
def __getattr__(self, item):
if item in self.__config:
return _get_item_and_maybe_make_dotindexable(self.__config, item)
def __getitem__(self, item):
return self.__getattr__(item)
CONFIG = Config(CONFIG_FILE)

View File

@ -1,14 +0,0 @@
from os import path
MODULE_DIR = path.dirname(path.abspath(__file__))
PACKAGE_ROOT_DIR = path.dirname(MODULE_DIR)
REPO_ROOT_DIR = path.dirname(path.dirname(PACKAGE_ROOT_DIR))
DOCKER_COMPOSE_FILE = path.join(REPO_ROOT_DIR, "docker-compose.yaml")
CONFIG_FILE = path.join(PACKAGE_ROOT_DIR, "config.yaml")
LOG_FILE = "/tmp/log.log"
DATA_DIR = path.join(PACKAGE_ROOT_DIR, "data")
MLRUNS_DIR = path.join(DATA_DIR, "mlruns")
BASE_WEIGHTS = path.join(DATA_DIR, "base_weights.h5")

View File

@ -1,116 +0,0 @@
import logging
from itertools import chain
from operator import itemgetter
from typing import List, Dict, Iterable
import numpy as np
from image_prediction.config import CONFIG
from image_prediction.locations import MLRUNS_DIR, BASE_WEIGHTS
from incl.redai_image.redai.redai.backend.model.model_handle import ModelHandle
from incl.redai_image.redai.redai.backend.pdf.image_extraction import extract_and_stitch
from incl.redai_image.redai.redai.utils.mlflow_reader import MlflowModelReader
from incl.redai_image.redai.redai.utils.shared import chunk_iterable
class Predictor:
"""`ModelHandle` wrapper. Forwards to wrapped model handle for prediction and produces structured output that is
interpretable independently of the wrapped model (e.g. with regard to a .classes_ attribute).
"""
def __init__(self, model_handle: ModelHandle = None):
"""Initializes a ServiceEstimator.
Args:
model_handle: ModelHandle object to forward to for prediction. By default, a model handle is loaded from the
mlflow database via CONFIG.service.run_id.
"""
try:
if model_handle is None:
reader = MlflowModelReader(run_id=CONFIG.service.run_id, mlruns_dir=MLRUNS_DIR)
self.model_handle = reader.get_model_handle(BASE_WEIGHTS)
else:
self.model_handle = model_handle
self.classes = self.model_handle.model.classes_
self.classes_readable = np.array(self.model_handle.classes)
self.classes_readable_aligned = self.classes_readable[self.classes[list(range(len(self.classes)))]]
except Exception as e:
logging.info(f"Service estimator initialization failed: {e}")
def __make_predictions_human_readable(self, probs: np.ndarray) -> List[Dict[str, float]]:
"""Translates an n x m matrix of probabilities over classes into an n-element list of mappings from classes to
probabilities.
Args:
probs: probability matrix (items x classes)
Returns:
list of mappings from classes to probabilities.
"""
classes = np.argmax(probs, axis=1)
classes = self.classes[classes]
classes_readable = [self.model_handle.classes[c] for c in classes]
return classes_readable
def predict(self, images: List, probabilities: bool = False, **kwargs):
"""Gathers predictions for list of images. Assigns each image a class and optionally a probability distribution
over all classes.
Args:
images (List[PIL.Image]) : Images to gather predictions for.
probabilities: Whether to return dictionaries of the following form instead of strings:
{
"class": predicted class,
"probabilities": {
"class 1" : class 1 probability,
"class 2" : class 2 probability,
...
}
}
Returns:
By default the return value is a list of classes (meaningful class name strings). Alternatively a list of
dictionaries with an additional probability field for estimated class probabilities per image can be
returned.
"""
X = self.model_handle.prep_images(list(images))
probs_per_item = self.model_handle.model.predict_proba(X, **kwargs).astype(float)
classes = self.__make_predictions_human_readable(probs_per_item)
class2prob_per_item = [dict(zip(self.classes_readable_aligned, probs)) for probs in probs_per_item]
class2prob_per_item = [
dict(sorted(c2p.items(), key=itemgetter(1), reverse=True)) for c2p in class2prob_per_item
]
predictions = [{"class": c, "probabilities": c2p} for c, c2p in zip(classes, class2prob_per_item)]
return predictions if probabilities else classes
def extract_image_metadata_pairs(pdf_path: str, **kwargs):
def image_is_large_enough(metadata: dict):
x1, x2, y1, y2 = itemgetter("x1", "x2", "y1", "y2")(metadata)
return abs(x1 - x2) > 2 and abs(y1 - y2) > 2
yield from extract_and_stitch(pdf_path, convert_to_rgb=True, filter_fn=image_is_large_enough, **kwargs)
def classify_images(predictor, image_metadata_pairs: Iterable, batch_size: int = CONFIG.service.batch_size):
def process_chunk(chunk):
images, metadata = zip(*chunk)
predictions = predictor.predict(images, probabilities=True)
return predictions, metadata
def predict(image_metadata_pair_generator):
chunks = chunk_iterable(image_metadata_pair_generator, n=batch_size)
return map(chain.from_iterable, zip(*map(process_chunk, chunks)))
try:
predictions, metadata = predict(image_metadata_pairs)
return predictions, metadata
except ValueError:
return [], []

View File

@ -1,71 +0,0 @@
"""Defines functions for constructing service responses."""
from itertools import starmap
from operator import itemgetter
import numpy as np
from image_prediction.config import CONFIG
def build_response(predictions: list, metadata: list) -> list:
return list(starmap(build_image_info, zip(predictions, metadata)))
def build_image_info(prediction: dict, metadata: dict) -> dict:
def compute_geometric_quotient():
page_area_sqrt = np.sqrt(abs(page_width * page_height))
image_area_sqrt = np.sqrt(abs(x2 - x1) * abs(y2 - y1))
return image_area_sqrt / page_area_sqrt
page_width, page_height, x1, x2, y1, y2, width, height = itemgetter(
"page_width", "page_height", "x1", "x2", "y1", "y2", "width", "height"
)(metadata)
quotient = compute_geometric_quotient()
min_image_to_page_quotient_breached = bool(quotient < CONFIG.filters.image_to_page_quotient.min)
max_image_to_page_quotient_breached = bool(quotient > CONFIG.filters.image_to_page_quotient.max)
min_image_width_to_height_quotient_breached = bool(
width / height < CONFIG.filters.image_width_to_height_quotient.min
)
max_image_width_to_height_quotient_breached = bool(
width / height > CONFIG.filters.image_width_to_height_quotient.max
)
min_confidence_breached = bool(max(prediction["probabilities"].values()) < CONFIG.filters.min_confidence)
prediction["label"] = prediction.pop("class") # "class" as field name causes problem for Java objectmapper
prediction["probabilities"] = {klass: np.round(prob, 6) for klass, prob in prediction["probabilities"].items()}
image_info = {
"classification": prediction,
"position": {"x1": x1, "x2": x2, "y1": y1, "y2": y2, "pageNumber": metadata["page_idx"] + 1},
"geometry": {"width": width, "height": height},
"filters": {
"geometry": {
"imageSize": {
"quotient": quotient,
"tooLarge": max_image_to_page_quotient_breached,
"tooSmall": min_image_to_page_quotient_breached,
},
"imageFormat": {
"quotient": width / height,
"tooTall": min_image_width_to_height_quotient_breached,
"tooWide": max_image_width_to_height_quotient_breached,
},
},
"probability": {"unconfident": min_confidence_breached},
"allPassed": not any(
[
max_image_to_page_quotient_breached,
min_image_to_page_quotient_breached,
min_image_width_to_height_quotient_breached,
max_image_width_to_height_quotient_breached,
min_confidence_breached,
]
),
},
}
return image_info

@ -1 +0,0 @@
Subproject commit 4c3b26d7673457aaa99e0663dad6950cd36da967

7267
poetry.lock generated Normal file

File diff suppressed because it is too large Load Diff

73
pyproject.toml Normal file
View File

@ -0,0 +1,73 @@
[tool.poetry]
name = "image-classification-service"
version = "2.17.0"
description = ""
authors = ["Team Research <research@knecon.com>"]
readme = "README.md"
packages = [{ include = "image_prediction", from = "src" }]
[tool.poetry.dependencies]
python = ">=3.10,<3.11"
# FIXME: This should be recent pyinfra, but the recent protobuf packages are not compatible with tensorflow 2.9.0, also
# see RED-9948.
pyinfra = { version = "3.4.2", source = "gitlab-research" }
kn-utils = { version = ">=0.4.0", source = "gitlab-research" }
dvc = "^2.34.0"
dvc-ssh = "^2.20.0"
dvc-azure = "^2.21.2"
Flask = "^2.1.1"
requests = "^2.27.1"
iteration-utilities = "^0.11.0"
waitress = "^2.1.1"
envyaml = "^1.10.211231"
dependency-check = "^0.6.0"
mlflow = "^1.24.0"
numpy = "^1.22.3"
tqdm = "^4.64.0"
pandas = "^1.4.2"
# FIXME: Our current model significantly changes the prediction behaviour when using newer tensorflow (/ protobuf)
# versions which is introduuced by pyinfra updates using newer protobuf versions, see RED-9948.
tensorflow = "2.9.0"
protobuf = "^3.20"
pytest = "^7.1.0"
funcy = "^2"
PyMuPDF = "^1.19.6"
fpdf = "^1.7.2"
coverage = "^6.3.2"
Pillow = "^9.1.0"
pdf2image = "^1.16.0"
frozendict = "^2.3.0"
fsspec = "^2022.11.0"
PyMonad = "^2.4.0"
pdfnetpython3 = "9.4.2"
loguru = "^0.7.0"
cyclonedx-bom = "^4.5.0"
[tool.poetry.group.dev.dependencies]
pytest = "^7.0.1"
pymonad = "^2.4.0"
pylint = "^2.17.4"
ipykernel = "^6.23.2"
[tool.pytest.ini_options]
testpaths = ["test"]
addopts = "--ignore=data"
filterwarnings = ["ignore:.*:DeprecationWarning"]
[[tool.poetry.source]]
name = "PyPI"
priority = "primary"
[[tool.poetry.source]]
name = "gitlab-research"
url = "https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi/simple"
priority = "explicit"
[[tool.poetry.source]]
name = "gitlab-red"
url = "https://gitlab.knecon.com/api/v4/groups/12/-/packages/pypi/simple"
priority = "explicit"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

View File

@ -1,21 +0,0 @@
Flask==2.0.2
requests==2.27.1
iteration-utilities==0.11.0
dvc==2.9.3
dvc[ssh]
frozendict==2.3.0
waitress==2.0.0
envyaml~=1.8.210417
dependency-check==0.6.*
envyaml~=1.8.210417
mlflow~=1.20.2
numpy~=1.19.3
PDFNetPython3~=9.1.0
tqdm~=4.62.2
pandas~=1.3.1
mlflow~=1.20.2
tensorflow~=2.5.0
PDFNetPython3~=9.1.0
Pillow~=8.3.2
PyYAML~=5.4.1
scikit_learn~=0.24.2

46
scripts/debug/debug.py Normal file
View File

@ -0,0 +1,46 @@
"""Script to debug RED-9948. The predictions unexpectedly changed for some images, and we need to understand why."""
import json
import random
from pathlib import Path
import numpy as np
import tensorflow as tf
from kn_utils.logging import logger
from image_prediction.config import CONFIG
from image_prediction.pipeline import load_pipeline
def process_pdf(pipeline, pdf_path, page_range=None):
with open(pdf_path, "rb") as f:
logger.info(f"Processing {pdf_path}")
predictions = list(pipeline(f.read(), page_range=page_range))
return predictions
def ensure_seeds():
seed = 42
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed)
def debug_info():
devices = tf.config.list_physical_devices()
print("Available devices:", devices)
if __name__ == "__main__":
# For in container debugging, copy the file and adjust the path.
debug_file_path = Path(__file__).parents[2] / "test" / "data" / "RED-9948" / "SYNGENTA_EFSA_sanitisation_GFL_v2"
ensure_seeds()
debug_info()
pipeline = load_pipeline(verbose=True, batch_size=CONFIG.service.batch_size)
predictions = process_pdf(pipeline, debug_file_path)
# This is the image that has the wrong prediction mentioned in RED-9948. The predictions should inconclusive, and
# the flag all passed should be false.
predictions = [x for x in predictions if x["representation"] == "FA30F080F0C031CE17E8CF237"]
print(json.dumps(predictions, indent=2))

30
scripts/devenvsetup.sh Normal file
View File

@ -0,0 +1,30 @@
#!/bin/bash
python_version=$1
gitlab_user=$2
gitlab_personal_access_token=$3
# cookiecutter https://gitlab.knecon.com/knecon/research/template-python-project.git --checkout master
# latest_dir=$(ls -td -- */ | head -n 1) # should be the dir cookiecutter just created
# cd $latest_dir
pyenv install $python_version
pyenv local $python_version
pyenv shell $python_version
pip install --upgrade pip
pip install poetry
poetry config installer.max-workers 10
# research package registry
poetry config repositories.gitlab-research https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi
poetry config http-basic.gitlab-research ${gitlab_user} ${gitlab_personal_access_token}
# redactmanager package registry
poetry config repositories.gitlab-red https://gitlab.knecon.com/api/v4/groups/12/-/packages/pypi
poetry config http-basic.gitlab-red ${gitlab_user} ${gitlab_personal_access_token}
poetry env use $(pyenv which python)
poetry install --with=dev
poetry update
source .venv/bin/activate

View File

@ -0,0 +1,6 @@
docker build -t --platform linux/amd64 image-clsasification-service:$(poetry version -s)-dev \
-f Dockerfile \
--build-arg GITLAB_USER=$GITLAB_USER \
--build-arg GITLAB_ACCESS_TOKEN=$GITLAB_ACCESS_TOKEN \
. && \
docker run -it --rm image-clsasification-service:$(poetry version -s)-dev

View File

@ -0,0 +1,3 @@
docker tag image-clsasification-service:$(poetry version -s)-dev $NEXUS_REGISTRY/red/image-clsasification-service:$(poetry version -s)-dev
docker push $NEXUS_REGISTRY/red/image-clsasification-service:$(poetry version -s)-dev

View File

@ -0,0 +1,6 @@
from pyinfra.k8s_probes import startup
from loguru import logger
if __name__ == "__main__":
logger.debug("running health check")
startup.run_checks()

58
scripts/keras_MnWE.py Normal file
View File

@ -0,0 +1,58 @@
import multiprocessing
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
def process(predict_fn_wrapper):
# We observed memory doesn't get properly deallocated unless we do this:
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = multiprocessing.Process(
target=predict_fn_wrapper,
args=(return_dict,),
)
p.start()
p.join()
try:
return dict(return_dict)["result"]
except KeyError:
pass
def make_model():
inputs = keras.Input(shape=(784,))
dense = layers.Dense(64, activation="relu")
x = dense(inputs)
outputs = layers.Dense(10)(x)
model = keras.Model(inputs=inputs, outputs=outputs, name="mnist_model")
model.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.RMSprop(),
metrics=["accuracy"],
)
return model
def make_predict_fn():
# Keras bug: doesn't work in outer scope
model = make_model()
def predict(*args):
# service_estimator = make_model()
return model.predict(np.random.random(size=(1, 784)))
return predict
def make_predict_fn_wrapper(predict_fn):
def predict_fn_wrapper(return_dict):
return_dict["result"] = predict_fn()
return predict_fn_wrapper
if __name__ == "__main__":
predict_fn = make_predict_fn()
print(process(make_predict_fn_wrapper(predict_fn)))

View File

@ -6,7 +6,7 @@ import requests
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--pdf_path", required=True)
parser.add_argument("pdf_path")
args = parser.parse_args()
return args

58
scripts/run_pipeline.py Normal file
View File

@ -0,0 +1,58 @@
import argparse
import json
import os
from glob import glob
from image_prediction.config import CONFIG
from image_prediction.pipeline import load_pipeline
from image_prediction.utils import get_logger
from image_prediction.utils.pdf_annotation import annotate_pdf
logger = get_logger()
logger.setLevel("DEBUG")
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("input", help="pdf file or directory")
parser.add_argument("--print", "-p", help="print output to terminal", action="store_true", default=False)
parser.add_argument("--page_interval", "-i", help="page interval [i, j), min index = 0", nargs=2, type=int)
args = parser.parse_args()
return args
def process_pdf(pipeline, pdf_path, page_range=None):
with open(pdf_path, "rb") as f:
logger.info(f"Processing {pdf_path}")
predictions = list(pipeline(f.read(), page_range=page_range))
annotate_pdf(
pdf_path, predictions, os.path.join("/tmp", os.path.basename(pdf_path.replace(".pdf", "_annotated.pdf")))
)
return predictions
def main(args):
pipeline = load_pipeline(verbose=CONFIG.service.verbose, batch_size=CONFIG.service.batch_size, tolerance=CONFIG.service.image_stiching_tolerance)
if os.path.isfile(args.input):
pdf_paths = [args.input]
else:
pdf_paths = glob(os.path.join(args.input, "*.pdf"))
page_range = range(*args.page_interval) if args.page_interval else None
for pdf_path in pdf_paths:
predictions = process_pdf(pipeline, pdf_path, page_range=page_range)
if args.print:
print(pdf_path)
print(json.dumps(predictions, indent=2))
if __name__ == "__main__":
args = parse_args()
main(args)

15
scripts/run_tests.sh Executable file
View File

@ -0,0 +1,15 @@
echo "${bamboo_nexus_password}" | docker login --username "${bamboo_nexus_user}" --password-stdin nexus.iqser.com:5001
pip install dvc
pip install 'dvc[ssh]'
echo "Pulling dvc data"
dvc pull
docker build -f Dockerfile_tests -t image-prediction-tests .
rnd=$(date +"%s")
name=image-prediction-tests-${rnd}
echo "running tests container"
docker run --rm --name $name -v $PWD:$PWD -w $PWD -v /var/run/docker.sock:/var/run/docker.sock image-prediction-tests

View File

@ -1,13 +0,0 @@
#!/usr/bin/env python
from distutils.core import setup
setup(
name="image_prediction",
version="0.1.0",
description="",
author="",
author_email="",
url="",
packages=["image_prediction"],
)

View File

@ -1,15 +0,0 @@
#!/bin/bash
set -e
python3 -m venv build_venv
source build_venv/bin/activate
python3 -m pip install --upgrade pip
pip install dvc
pip install 'dvc[ssh]'
dvc pull
git submodule update --init --recursive
docker build -f Dockerfile_base -t image-prediction-base .
docker build -f Dockerfile -t image-prediction .

View File

@ -1,4 +0,0 @@
sonar.exclusions=bamboo-specs/**, **/test_data/**
sonar.c.file.suffixes=-
sonar.cpp.file.suffixes=-
sonar.objc.file.suffixes=-

View File

@ -0,0 +1,13 @@
import logging
import sys
# log config
LOG_FORMAT = "%(asctime)s [%(levelname)s] - [%(filename)s -> %(funcName)s() -> %(lineno)s] : %(message)s"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler_format = logging.Formatter(LOG_FORMAT, datefmt=DATE_FORMAT)
stream_handler.setFormatter(stream_handler_format)
logger = logging.getLogger(__name__)
logger.propagate = False
logger.addHandler(stream_handler)

View File

@ -0,0 +1,35 @@
from typing import List, Union, Tuple
import numpy as np
from PIL.Image import Image
from funcy import rcompose
from image_prediction.estimator.adapter.adapter import EstimatorAdapter
from image_prediction.label_mapper.mapper import LabelMapper
from image_prediction.utils import get_logger
logger = get_logger()
class Classifier:
def __init__(self, estimator_adapter: EstimatorAdapter, label_mapper: LabelMapper):
"""Abstraction layer over different estimator backends (e.g. keras or scikit-learn). For each backend to be used
an EstimatorAdapter must be implemented.
Args:
estimator_adapter: adapter for a given estimator backend
"""
self.__estimator_adapter = estimator_adapter
self.__label_mapper = label_mapper
self.__pipe = rcompose(self.__estimator_adapter, self.__label_mapper)
def predict(self, batch: Union[np.array, Tuple[Image]]) -> List[str]:
if isinstance(batch, np.ndarray) and batch.shape[0] == 0:
return []
return self.__pipe(batch)
def __call__(self, batch: np.array) -> List[str]:
logger.debug("Classifier.predict")
return self.predict(batch)

View File

@ -0,0 +1,32 @@
from itertools import chain
from typing import Iterable
from PIL.Image import Image
from funcy import rcompose, chunks
from image_prediction.classifier.classifier import Classifier
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
from image_prediction.estimator.preprocessor.preprocessors.identity import IdentityPreprocessor
from image_prediction.utils import get_logger
logger = get_logger()
class ImageClassifier:
"""Combines a classifier with a preprocessing pipeline: Receives images, chunks into batches, converts to tensors,
applies transformations and finally sends to internal classifier.
"""
def __init__(self, classifier: Classifier, preprocessor: Preprocessor = None):
self.estimator = classifier
self.preprocessor = preprocessor if preprocessor else IdentityPreprocessor()
self.pipe = rcompose(self.preprocessor, self.estimator)
def predict(self, images: Iterable[Image], batch_size=16):
batches = chunks(batch_size, images)
predictions = chain.from_iterable(map(self.pipe, batches))
return predictions
def __call__(self, images: Iterable[Image], batch_size=16):
logger.debug("ImageClassifier.predict")
yield from self.predict(images, batch_size=batch_size)

View File

@ -0,0 +1,16 @@
from funcy import rcompose
from image_prediction.transformer.transformer import Transformer
from image_prediction.utils import get_logger
logger = get_logger()
class TransformerCompositor(Transformer):
def __init__(self, formatter: Transformer, *formatters: Transformer):
formatters = (formatter, *formatters)
self.pipe = rcompose(*formatters)
def transform(self, obj):
logger.debug("TransformerCompositor.transform")
return self.pipe(obj)

View File

@ -0,0 +1,7 @@
from pathlib import Path
from pyinfra.config.loader import load_settings
from image_prediction.locations import PROJECT_ROOT_DIR
CONFIG = load_settings(root_path=PROJECT_ROOT_DIR, settings_path="config")

View File

@ -0,0 +1,43 @@
from funcy import juxt
from image_prediction.classifier.classifier import Classifier
from image_prediction.classifier.image_classifier import ImageClassifier
from image_prediction.compositor.compositor import TransformerCompositor
from image_prediction.encoder.encoders.hash_encoder import HashEncoder
from image_prediction.estimator.adapter.adapter import EstimatorAdapter
from image_prediction.formatter.formatters.camel_case import Snake2CamelCaseKeyFormatter
from image_prediction.formatter.formatters.enum import EnumFormatter
from image_prediction.image_extractor.extractors.parsable import ParsablePDFImageExtractor
from image_prediction.label_mapper.mappers.probability import ProbabilityMapper
from image_prediction.model_loader.loader import ModelLoader
from image_prediction.model_loader.loaders.mlflow import MlflowConnector
from image_prediction.redai_adapter.mlflow import MlflowModelReader
from image_prediction.transformer.transformers.coordinate.pdfnet import PDFNetCoordinateTransformer
from image_prediction.transformer.transformers.response import ResponseTransformer
def get_mlflow_model_loader(mlruns_dir):
model_loader = ModelLoader(MlflowConnector(MlflowModelReader(mlruns_dir)))
return model_loader
def get_image_classifier(model_loader, model_identifier):
model, classes = juxt(model_loader.load_model, model_loader.load_classes)(model_identifier)
return ImageClassifier(Classifier(EstimatorAdapter(model), ProbabilityMapper(classes)))
def get_extractor(**kwargs):
image_extractor = ParsablePDFImageExtractor(**kwargs)
return image_extractor
def get_formatter():
formatter = TransformerCompositor(
PDFNetCoordinateTransformer(), EnumFormatter(), ResponseTransformer(), Snake2CamelCaseKeyFormatter()
)
return formatter
def get_encoder():
return HashEncoder()

View File

@ -0,0 +1,13 @@
import abc
from typing import Iterable
from PIL.Image import Image
class Encoder(abc.ABC):
@abc.abstractmethod
def encode(self, images: Iterable[Image]):
raise NotImplementedError
def __call__(self, images: Iterable[Image], batch_size=16):
yield from self.encode(images)

View File

@ -0,0 +1,26 @@
from typing import Iterable
from PIL import Image
from image_prediction.encoder.encoder import Encoder
class HashEncoder(Encoder):
def encode(self, images: Iterable[Image.Image]):
yield from map(hash_image, images)
def __call__(self, images: Iterable[Image.Image], batch_size=16):
yield from self.encode(images)
def hash_image(image: Image.Image) -> str:
"""See: https://stackoverflow.com/a/49692185/3578468"""
image = image.resize((10, 10), Image.ANTIALIAS)
image = image.convert("L")
pixel_data = list(image.getdata())
avg_pixel = sum(pixel_data) / len(pixel_data)
bits = "".join(["1" if (px >= avg_pixel) else "0" for px in pixel_data])
hex_representation = str(hex(int(bits, 2)))[2:][::-1].upper()
# Note: For each 4 leading zeros, the hex representation will be shorter by one character.
# To ensure that all hashes have the same length, we pad the hex representation with zeros (also see RED-3813).
return hex_representation.zfill(25)

View File

@ -0,0 +1,15 @@
from image_prediction.utils import get_logger
logger = get_logger()
class EstimatorAdapter:
def __init__(self, estimator):
self.estimator = estimator
def predict(self, batch):
return self.estimator(batch)
def __call__(self, batch):
logger.debug("EstimatorAdapter.predict")
return self.predict(batch)

View File

@ -0,0 +1,10 @@
import abc
class Preprocessor(abc.ABC):
@abc.abstractmethod
def preprocess(self, batch):
raise NotImplementedError
def __call__(self, batch):
return self.preprocess(batch)

View File

@ -0,0 +1,10 @@
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
from image_prediction.estimator.preprocessor.utils import images_to_batch_tensor
class BasicPreprocessor(Preprocessor):
"""Converts images to tensors"""
@staticmethod
def preprocess(images):
return images_to_batch_tensor(images)

View File

@ -0,0 +1,10 @@
from image_prediction.estimator.preprocessor.preprocessor import Preprocessor
class IdentityPreprocessor(Preprocessor):
@staticmethod
def preprocess(images):
return images
def __call__(self, images):
return self.preprocess(images)

View File

@ -0,0 +1,10 @@
import numpy as np
from PIL.Image import Image
def image_to_normalized_tensor(image: Image) -> np.ndarray:
return np.array(image) / 255
def images_to_batch_tensor(images) -> np.ndarray:
return np.array(list(map(image_to_normalized_tensor, images)))

View File

@ -0,0 +1,42 @@
class UnknownEstimatorAdapter(ValueError):
pass
class UnknownImageExtractor(ValueError):
pass
class UnknownModelLoader(ValueError):
pass
class UnknownDatabaseType(ValueError):
pass
class UnknownLabelFormat(ValueError):
pass
class UnexpectedLabelFormat(ValueError):
pass
class IncorrectInstantiation(RuntimeError):
pass
class IntentionalTestException(RuntimeError):
pass
class InvalidBox(Exception):
pass
class ParsingError(Exception):
pass
class BadXref(ValueError):
pass

View File

@ -0,0 +1,13 @@
from image_prediction.image_extractor.extractors.parsable import ParsablePDFImageExtractor
def extract_images_from_pdf(pdf, extractor=None):
if not extractor:
extractor = ParsablePDFImageExtractor()
try:
images_extracted, metadata_extracted = zip(*extractor(pdf))
return images_extracted, metadata_extracted
except ValueError:
return [], []

View File

@ -0,0 +1,60 @@
from typing import Callable
from flask import Flask, request, jsonify
from prometheus_client import generate_latest, CollectorRegistry, Summary
from image_prediction.utils import get_logger
from image_prediction.utils.process_wrapping import wrap_in_process
logger = get_logger()
def make_prediction_server(predict_fn: Callable):
app = Flask(__name__)
registry = CollectorRegistry(auto_describe=True)
metric = Summary(
f"redactmanager_imageClassification_seconds", f"Time spent on image-service classification.", registry=registry
)
@app.route("/ready", methods=["GET"])
def ready():
resp = jsonify("OK")
resp.status_code = 200
return resp
@app.route("/health", methods=["GET"])
def healthy():
resp = jsonify("OK")
resp.status_code = 200
return resp
def __failure():
response = jsonify("Analysis failed")
response.status_code = 500
return response
@app.route("/predict", methods=["POST"])
@app.route("/", methods=["POST"])
@metric.time()
def predict():
# Tensorflow does not free RAM. Workaround: Run prediction function (which instantiates a model) in sub-process.
# See: https://stackoverflow.com/questions/39758094/clearing-tensorflow-gpu-memory-after-model-execution
predict_fn_wrapped = wrap_in_process(predict_fn)
logger.info("Analysing...")
predictions = predict_fn_wrapped(request.data)
if predictions is not None:
response = jsonify(predictions)
logger.info("Analysis completed.")
return response
else:
logger.error("Analysis failed.")
return __failure()
@app.route("/prometheus", methods=["GET"])
def prometheus():
return generate_latest(registry=registry)
return app

View File

@ -0,0 +1,15 @@
import abc
from image_prediction.transformer.transformer import Transformer
class Formatter(Transformer):
@abc.abstractmethod
def format(self, obj):
raise NotImplementedError
def transform(self, obj):
raise NotImplementedError()
def __call__(self, obj):
return self.format(obj)

View File

@ -0,0 +1,11 @@
from image_prediction.formatter.formatters.key_formatter import KeyFormatter
class Snake2CamelCaseKeyFormatter(KeyFormatter):
def format_key(self, key):
if isinstance(key, str):
head, *tail = key.split("_")
return head + "".join(map(str.title, tail))
else:
return key

View File

@ -0,0 +1,23 @@
from enum import Enum
from image_prediction.formatter.formatters.key_formatter import KeyFormatter
class EnumFormatter(KeyFormatter):
def format_key(self, key):
return key.value if isinstance(key, Enum) else key
def transform(self, obj):
raise NotImplementedError
class ReverseEnumFormatter(KeyFormatter):
def __init__(self, enum):
self.enum = enum
self.reverse_enum = {e.value: e for e in enum}
def format_key(self, key):
return self.reverse_enum.get(key, key)
def transform(self, obj):
raise NotImplementedError

View File

@ -0,0 +1,6 @@
from image_prediction.formatter.formatter import Formatter
class IdentityFormatter(Formatter):
def format(self, obj):
return obj

View File

@ -0,0 +1,28 @@
import abc
from typing import Iterable
from image_prediction.formatter.formatter import Formatter
class KeyFormatter(Formatter):
@abc.abstractmethod
def format_key(self, key):
raise NotImplementedError
def __format(self, data):
# If we wanted to do this properly, we would need handlers for all expected types and dispatch based
# on a type comparison. This is too much engineering for the limited use-case of this class though.
if isinstance(data, Iterable) and not isinstance(data, dict) and not isinstance(data, str):
f = map(self.__format, data)
return type(data)(f) if not isinstance(data, map) else f
if not isinstance(data, dict):
return data
keys_formatted = list(map(self.format_key, data))
return dict(zip(keys_formatted, map(self.__format, data.values())))
def format(self, data):
return self.__format(data)

View File

@ -0,0 +1,19 @@
import abc
from collections import namedtuple
from typing import Iterable
from image_prediction.utils import get_logger
ImageMetadataPair = namedtuple("ImageMetadataPair", ["image", "metadata"])
logger = get_logger()
class ImageExtractor(abc.ABC):
@abc.abstractmethod
def extract(self, obj) -> Iterable[ImageMetadataPair]:
raise NotImplementedError
def __call__(self, obj, **kwargs):
logger.debug("ImageExtractor.extract")
return self.extract(obj, **kwargs)

View File

@ -0,0 +1,7 @@
from image_prediction.image_extractor.extractor import ImageExtractor, ImageMetadataPair
class ImageExtractorMock(ImageExtractor):
def extract(self, image_container):
for i, image in enumerate(image_container):
yield ImageMetadataPair(image, {"image_id": i})

View File

@ -0,0 +1,300 @@
import atexit
import json
import traceback
from _operator import itemgetter
from functools import partial, lru_cache
from itertools import chain, starmap, filterfalse, tee
from operator import itemgetter, truth
from typing import Iterable, Iterator, List, Union
import fitz
import numpy as np
from PIL import Image
from funcy import merge, pluck, compose, rcompose, remove, keep
from scipy.stats import gmean
from image_prediction.config import CONFIG
from image_prediction.exceptions import InvalidBox
from image_prediction.formatter.formatters.enum import EnumFormatter
from image_prediction.image_extractor.extractor import ImageExtractor, ImageMetadataPair
from image_prediction.info import Info
from image_prediction.stitching.stitching import stitch_pairs
from image_prediction.stitching.utils import validate_box
from image_prediction.transformer.transformers.response import compute_geometric_quotient
from image_prediction.utils import get_logger
logger = get_logger()
class ParsablePDFImageExtractor(ImageExtractor):
def __init__(self, verbose=False, tolerance=0):
"""
Args:
verbose: Whether to show progressbar
tolerance: The tolerance in pixels for the distance between images, beyond which they will not be stitched
together
"""
self.doc: fitz.Document = None
self.verbose = verbose
self.tolerance = tolerance
def extract(self, pdf: bytes, page_range: range = None):
self.doc = fitz.Document(stream=pdf)
pages = extract_pages(self.doc, page_range) if page_range else self.doc
image_metadata_pairs = chain.from_iterable(map(self.__process_images_on_page, pages))
yield from image_metadata_pairs
def __process_images_on_page(self, page: fitz.Page):
metadata = extract_valid_metadata(self.doc, page)
images = get_images_on_page(self.doc, metadata)
clear_caches()
image_metadata_pairs = starmap(ImageMetadataPair, filter(all, zip(images, metadata)))
# TODO: In the future, consider to introduce an image validator as a pipeline component rather than doing the
# validation here. Invalid images can then be split into a different stream and joined with the intact images
# again for the formatting step.
image_metadata_pairs = self.__filter_valid_images(image_metadata_pairs)
image_metadata_pairs = stitch_pairs(list(image_metadata_pairs), tolerance=self.tolerance)
yield from image_metadata_pairs
@staticmethod
def __filter_valid_images(image_metadata_pairs: Iterable[ImageMetadataPair]) -> Iterator[ImageMetadataPair]:
def validate_image_is_not_corrupt(image: Image.Image, metadata: dict):
"""See RED-5148: Some images are corrupt and cannot be processed by the image classifier. This function
filters out such images by trying to resize and convert them to RGB. If this fails, the image is considered
corrupt and is dropped.
TODO: find cleaner solution
"""
try:
image.resize((100, 100)).convert("RGB")
return ImageMetadataPair(image, metadata)
except (OSError, Exception) as err:
metadata = json.dumps(EnumFormatter()(metadata), indent=2)
logger.warning(f"Invalid image encountered. Image metadata:\n{metadata}\n\n{traceback.format_exc()}")
return None
def filter_small_images_on_scanned_pages(image_metadata_pairs) -> Iterable[ImageMetadataPair]:
"""See RED-9746: Small images on scanned pages should be dropped, so they are not classified. This is a
heuristic to filter out images that are too small in relation to the page size if they are on a scanned page.
The ratio is computed as the geometric mean of the width and height of the image divided by the geometric mean
of the width and height of the page. If the ratio is below the threshold, the image is dropped.
"""
def image_is_a_scanned_page(image_metadata_pair: ImageMetadataPair) -> bool:
tolerance = CONFIG.filters.is_scanned_page.tolerance
width_ratio = image_metadata_pair.metadata[Info.WIDTH] / image_metadata_pair.metadata[Info.PAGE_WIDTH]
height_ratio = (
image_metadata_pair.metadata[Info.HEIGHT] / image_metadata_pair.metadata[Info.PAGE_HEIGHT]
)
return width_ratio >= 1 - tolerance and height_ratio >= 1 - tolerance
def image_fits_geometric_mean_ratio(image_metadata_pair: ImageMetadataPair) -> bool:
min_ratio = CONFIG.filters.image_to_page_quotient.min
metadatum = image_metadata_pair.metadata
image_gmean = gmean([metadatum[Info.WIDTH], metadatum[Info.HEIGHT]])
page_gmean = gmean([metadatum[Info.PAGE_WIDTH], metadatum[Info.PAGE_HEIGHT]])
ratio = image_gmean / page_gmean
return ratio >= min_ratio
pairs, pairs_copy = tee(image_metadata_pairs)
if any(map(image_is_a_scanned_page, pairs_copy)):
logger.debug("Scanned page detected, filtering out small images ...")
return filter(image_fits_geometric_mean_ratio, pairs)
else:
return pairs
image_metadata_pairs = filter_small_images_on_scanned_pages(image_metadata_pairs)
return filter(truth, starmap(validate_image_is_not_corrupt, image_metadata_pairs))
def extract_pages(doc, page_range):
page_range = range(page_range.start + 1, page_range.stop + 1)
pages = map(doc.load_page, page_range)
yield from pages
def get_images_on_page(doc, metadata):
xrefs = pluck(Info.XREF, metadata)
images = map(partial(xref_to_image, doc), xrefs)
yield from images
def extract_valid_metadata(doc: fitz.Document, page: fitz.Page):
metadata = get_metadata_for_images_on_page(page)
metadata = filter_valid_metadata(metadata)
metadata = add_alpha_channel_info(doc, metadata)
return list(metadata)
def get_metadata_for_images_on_page(page: fitz.Page):
metadata = map(get_image_metadata, get_image_infos(page))
metadata = add_page_metadata(page, metadata)
yield from metadata
def filter_valid_metadata(metadata):
yield from compose(
# TODO: Disabled for now, since atm since the backend needs atm the metadata and the hash of every image, even
# scanned pages. In the future, this should be resolved differently, e.g. by filtering all page-sized images
# and giving the user the ability to reclassify false positives with a separate call.
# filter_out_page_sized_images,
filter_out_tiny_images,
filter_out_invalid_metadata,
)(metadata)
def filter_out_invalid_metadata(metadata):
def __validate_box(box):
try:
return validate_box(box)
except InvalidBox as err:
logger.debug(f"Dropping invalid metadatum, reason: {err}")
yield from keep(__validate_box, metadata)
def filter_out_page_sized_images(metadata):
yield from remove(breaches_image_to_page_quotient, metadata)
def filter_out_tiny_images(metadata):
yield from filterfalse(tiny, metadata)
@lru_cache(maxsize=None)
def get_image_infos(page: fitz.Page) -> List[dict]:
return page.get_image_info(xrefs=True)
@lru_cache(maxsize=None)
def xref_to_image(doc, xref) -> Union[Image.Image, None]:
# NOTE: image extraction is done via pixmap to array, as this method is twice as fast as extraction via bytestream
try:
pixmap = fitz.Pixmap(doc, xref)
array = convert_pixmap_to_array(pixmap)
return Image.fromarray(array)
except ValueError:
logger.debug(f"Xref {xref} is invalid, skipping extraction ...")
return
def convert_pixmap_to_array(pixmap: fitz.Pixmap):
array = np.frombuffer(pixmap.samples, dtype=np.uint8).reshape(pixmap.h, pixmap.w, pixmap.n)
array = _normalize_channels(array)
return array
def _normalize_channels(array: np.ndarray):
if array.shape[-1] == 1:
array = array[:, :, 0]
elif array.shape[-1] == 4:
array = array[..., :3]
elif array.shape[-1] != 3:
logger.warning(f"Unexpected image format: {array.shape}.")
raise ValueError(f"Unexpected image format: {array.shape}.")
return array
def get_image_metadata(image_info):
xref, coords = itemgetter("xref", "bbox")(image_info)
x1, y1, x2, y2 = map(rounder, coords)
width = abs(x2 - x1)
height = abs(y2 - y1)
return {
Info.WIDTH: width,
Info.HEIGHT: height,
Info.X1: x1,
Info.X2: x2,
Info.Y1: y1,
Info.Y2: y2,
Info.XREF: xref,
}
def add_page_metadata(page, metadata):
yield from map(partial(merge, get_page_metadata(page)), metadata)
def add_alpha_channel_info(doc, metadata):
def add_alpha_value_to_metadatum(metadatum):
alpha = metadatum_to_alpha_value(metadatum)
return {**metadatum, Info.ALPHA: alpha}
xref_to_alpha = partial(has_alpha_channel, doc)
metadatum_to_alpha_value = compose(xref_to_alpha, itemgetter(Info.XREF))
yield from map(add_alpha_value_to_metadatum, metadata)
@lru_cache(maxsize=None)
def load_image_handle_from_xref(doc, xref):
try:
return doc.extract_image(xref)
except ValueError:
logger.debug(f"Xref {xref} is invalid, skipping extraction ...")
return
rounder = rcompose(round, int)
def get_page_metadata(page):
page_width, page_height = map(rounder, page.mediabox_size)
return {
Info.PAGE_WIDTH: page_width,
Info.PAGE_HEIGHT: page_height,
Info.PAGE_IDX: page.number,
}
def has_alpha_channel(doc, xref):
maybe_image = load_image_handle_from_xref(doc, xref)
maybe_smask = maybe_image["smask"] if maybe_image else None
if maybe_smask:
return any([doc.extract_image(maybe_smask) is not None, bool(fitz.Pixmap(doc, maybe_smask).alpha)])
else:
try:
return bool(fitz.Pixmap(doc, xref).alpha)
except ValueError:
logger.debug(f"Encountered invalid xref `{xref}` in {doc.metadata.get('title', '<no title>')}.")
return False
def tiny(metadata):
return metadata[Info.WIDTH] * metadata[Info.HEIGHT] <= 4
def clear_caches():
get_image_infos.cache_clear()
load_image_handle_from_xref.cache_clear()
xref_to_image.cache_clear()
atexit.register(clear_caches)
def breaches_image_to_page_quotient(metadatum):
page_width, page_height, x1, x2, y1, y2, width, height = itemgetter(
Info.PAGE_WIDTH, Info.PAGE_HEIGHT, Info.X1, Info.X2, Info.Y1, Info.Y2, Info.WIDTH, Info.HEIGHT
)(metadatum)
geometric_quotient = compute_geometric_quotient(page_width, page_height, x2, x1, y2, y1)
quotient_breached = bool(geometric_quotient > CONFIG.filters.image_to_page_quotient.max)
return quotient_breached

View File

@ -0,0 +1,15 @@
from enum import Enum
class Info(Enum):
PAGE_WIDTH = "page_width"
PAGE_HEIGHT = "page_height"
PAGE_IDX = "page_idx"
WIDTH = "width"
HEIGHT = "height"
X1 = "x1"
X2 = "x2"
Y1 = "y1"
Y2 = "y2"
ALPHA = "alpha"
XREF = "xref"

View File

@ -0,0 +1,10 @@
import abc
class LabelMapper(abc.ABC):
@abc.abstractmethod
def map_labels(self, items):
raise NotImplementedError
def __call__(self, items):
return self.map_labels(items)

View File

@ -0,0 +1,20 @@
from typing import Mapping, Iterable
from image_prediction.exceptions import UnexpectedLabelFormat
from image_prediction.label_mapper.mapper import LabelMapper
class IndexMapper(LabelMapper):
def __init__(self, labels: Mapping[int, str]):
self.__labels = labels
def __validate_index_label_format(self, index_label: int) -> None:
if not 0 <= index_label < len(self.__labels):
raise UnexpectedLabelFormat(f"Received index label '{index_label}' that has no associated string label.")
def __map_label(self, index_label: int) -> str:
self.__validate_index_label_format(index_label)
return self.__labels[index_label]
def map_labels(self, index_labels: Iterable[int]) -> Iterable[str]:
return map(self.__map_label, index_labels)

View File

@ -0,0 +1,39 @@
from enum import Enum
from operator import itemgetter
from typing import Mapping, Iterable
import numpy as np
from funcy import rcompose, rpartial
from image_prediction.exceptions import UnexpectedLabelFormat
from image_prediction.label_mapper.mapper import LabelMapper
class ProbabilityMapperKeys(Enum):
LABEL = "label"
PROBABILITIES = "probabilities"
class ProbabilityMapper(LabelMapper):
def __init__(self, labels: Mapping[int, str]):
self.__labels = labels
# String conversion in the middle due to floating point precision issues.
# See: https://stackoverflow.com/questions/56820/round-doesnt-seem-to-be-rounding-properly
self.__rounder = rcompose(rpartial(round, 4), str, float)
def __validate_array_label_format(self, probabilities: np.ndarray) -> None:
if not len(probabilities) == len(self.__labels):
raise UnexpectedLabelFormat(
f"Received fewer probabilities ({len(probabilities)}) than labels were passed ({len(self.__labels)})."
)
def __map_array(self, probabilities: np.ndarray) -> dict:
self.__validate_array_label_format(probabilities)
cls2prob = dict(
sorted(zip(self.__labels, list(map(self.__rounder, probabilities))), key=itemgetter(1), reverse=True)
)
most_likely = [*cls2prob][0]
return {ProbabilityMapperKeys.LABEL: most_likely, ProbabilityMapperKeys.PROBABILITIES: cls2prob}
def map_labels(self, probabilities: Iterable[np.ndarray]) -> Iterable[dict]:
return map(self.__map_array, probabilities)

View File

@ -0,0 +1,18 @@
"""Defines constant paths relative to the module root path."""
from pathlib import Path
# FIXME: move these paths to config, only depending on 'ROOT_PATH' environment variable.
MODULE_DIR = Path(__file__).resolve().parents[0]
PACKAGE_ROOT_DIR = MODULE_DIR.parents[0]
PROJECT_ROOT_DIR = PACKAGE_ROOT_DIR.parents[0]
CONFIG_FILE = PROJECT_ROOT_DIR / "config" / "settings.toml"
BANNER_FILE = PROJECT_ROOT_DIR / "banner.txt"
DATA_DIR = PROJECT_ROOT_DIR / "data"
MLRUNS_DIR = str(DATA_DIR / "mlruns")
TEST_DIR = PROJECT_ROOT_DIR / "test"
TEST_DATA_DIR = TEST_DIR / "data"
TEST_DATA_DIR_DVC = TEST_DIR / "data.dvc"

View File

@ -0,0 +1,7 @@
import abc
class DatabaseConnector(abc.ABC):
@abc.abstractmethod
def get_object(self, identifier):
raise NotImplementedError

View File

@ -0,0 +1,9 @@
from image_prediction.model_loader.database.connector import DatabaseConnector
class DatabaseConnectorMock(DatabaseConnector):
def __init__(self, store: dict):
self.store = store
def get_object(self, identifier):
return self.store[identifier]

View File

@ -0,0 +1,18 @@
from functools import lru_cache
from image_prediction.model_loader.database.connector import DatabaseConnector
class ModelLoader:
def __init__(self, database_connector: DatabaseConnector):
self.database_connector = database_connector
@lru_cache(maxsize=None)
def __get_object(self, identifier):
return self.database_connector.get_object(identifier)
def load_model(self, identifier):
return self.__get_object(identifier)["model"]
def load_classes(self, identifier):
return self.__get_object(identifier)["classes"]

View File

@ -0,0 +1,10 @@
from image_prediction.model_loader.database.connector import DatabaseConnector
from image_prediction.redai_adapter.mlflow import MlflowModelReader
class MlflowConnector(DatabaseConnector):
def __init__(self, mlflow_reader: MlflowModelReader):
self.mlflow_reader = mlflow_reader
def get_object(self, run_id):
return self.mlflow_reader[run_id]

View File

@ -0,0 +1,105 @@
import os
from functools import lru_cache, partial
from itertools import chain, tee
from typing import Iterable, Any
from funcy import rcompose, first, compose, second, chunks, identity, rpartial
from kn_utils.logging import logger
from tqdm import tqdm
from image_prediction.config import CONFIG
from image_prediction.default_objects import (
get_formatter,
get_mlflow_model_loader,
get_image_classifier,
get_extractor,
get_encoder,
)
from image_prediction.locations import MLRUNS_DIR
from image_prediction.utils.generic import lift, starlift
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
@lru_cache(maxsize=None)
def load_pipeline(**kwargs):
logger.info(f"Loading pipeline with kwargs: {kwargs}")
model_loader = get_mlflow_model_loader(MLRUNS_DIR)
model_identifier = CONFIG.service.mlflow_run_id
pipeline = Pipeline(model_loader, model_identifier, **kwargs)
return pipeline
def parallel(*fs):
return lambda *args: (f(a) for f, a in zip(fs, args))
def star(f):
return lambda x: f(*x)
class Pipeline:
def __init__(self, model_loader, model_identifier, batch_size=16, verbose=False, **kwargs):
self.verbose = verbose
extract = get_extractor(**kwargs)
classifier = get_image_classifier(model_loader, model_identifier)
reformat = get_formatter()
represent = get_encoder()
split = compose(star(parallel(*map(lift, (first, first, second)))), rpartial(tee, 3))
classify = compose(chain.from_iterable, lift(classifier), partial(chunks, batch_size))
pairwise_apply = compose(star, parallel)
join = compose(starlift(lambda prd, rpr, mdt: {"classification": prd, **mdt, "representation": rpr}), star(zip))
# />--classify--\
# --extract-->--split--+->--encode---->+--join-->reformat-->filter_duplicates
# \>--identity--/
self.pipe = rcompose(
extract, # ... image-metadata-pairs as a stream
split, # ... into an image stream and a metadata stream
pairwise_apply(classify, represent, identity), # ... apply functions to the streams pairwise
join, # ... the streams by zipping
reformat, # ... the items
filter_duplicates, # ... filter out duplicate images
)
def __call__(self, pdf: bytes, page_range: range = None):
yield from tqdm(
self.pipe(pdf, page_range=page_range),
desc="Processing images from document",
unit=" images",
disable=not self.verbose,
)
def filter_duplicates(metadata: Iterable[dict[str, Any]]) -> Iterable[dict[str, Any]]:
"""Filter out duplicate images from the `position` (image coordinates) and `page`, preferring the one with
`allPassed` set to True.
See RED-10765 (RM-241): Removed redactions reappear for why this is necessary.
"""
keep = dict()
for image_meta in metadata:
key: tuple[int, int, int, int, int] = (
image_meta["position"]["x1"],
image_meta["position"]["x2"],
image_meta["position"]["y1"],
image_meta["position"]["y2"],
image_meta["position"]["pageNumber"],
)
if key in keep:
logger.warning(
f"Duplicate image found: x1={key[0]}, x2={key[1]}, y1={key[2]}, y2={key[3]}, pageNumber={key[4]}"
)
if image_meta["filters"]["allPassed"]:
logger.warning("Setting the image with allPassed flag set to True")
keep[key] = image_meta
else:
logger.warning("Keeping the previous image since the current image has allPassed flag set to False")
else:
keep[key] = image_meta
yield from keep.values()

Some files were not shown because too many files have changed in this diff Show More