pyinfra/README.md

# PyInfra

1. [ About ](#about)
2. [ Configuration ](#configuration)
3. [ Response Format ](#response-format)
4. [ Usage & API ](#usage--api)
5. [ Scripts ](#scripts)
6. [ Tests ](#tests)

## About

Common Module with the infrastructure to deploy Research Projects.
The Infrastructure expects to be deployed in the same Pod / local environment as the analysis container and handles all outbound communication.

## Configuration

A configuration is located in `/config.yaml`. All relevant variables can be configured via exporting environment variables.

| Environment Variable          | Default                          | Description                                                              |
|-------------------------------|----------------------------------|--------------------------------------------------------------------------|
| LOGGING_LEVEL_ROOT            | "DEBUG"                          | Logging level for service logger                                         |
| MONITORING_ENABLED            | True                             | Enables Prometheus monitoring                                            |
| PROMETHEUS_METRIC_PREFIX      | "redactmanager_research_service" | Prometheus metric prefix, per convention '{product_name}_{service name}' |
| PROMETHEUS_HOST               | "127.0.0.1"                      | Prometheus webserver address                                             |
| PROMETHEUS_PORT               | 8080                             | Prometheus webserver port                                                |
| RABBITMQ_HOST                 | "localhost"                      | RabbitMQ host address                                                    |
| RABBITMQ_PORT                 | "5672"                           | RabbitMQ host port                                                       |
| RABBITMQ_USERNAME             | "user"                           | RabbitMQ username                                                        |
| RABBITMQ_PASSWORD             | "bitnami"                        | RabbitMQ password                                                        |
| RABBITMQ_HEARTBEAT            | 60                               | Controls AMQP heartbeat timeout in seconds                               |
| RABBITMQ_CONNECTION_SLEEP     | 5                                | Controls AMQP connection sleep timer in seconds                          |
| REQUEST_QUEUE                 | "request_queue"                  | Requests to service                                                      |
| RESPONSE_QUEUE                | "response_queue"                 | Responses by service                                                     |
| DEAD_LETTER_QUEUE             | "dead_letter_queue"              | Messages that failed to process                                          |
| STORAGE_BACKEND               | "s3"                             | The type of storage to use {s3, azure}                                   |
| STORAGE_BUCKET                | "redaction"                      | The bucket / container to pull files specified in queue requests from    |
| STORAGE_ENDPOINT              | "http://127.0.0.1:9000"          | Endpoint for s3 storage                                                  |
| STORAGE_KEY                   | "root"                           | User for s3 storage                                                      |
| STORAGE_SECRET                | "password"                       | Password for s3 storage                                                  |
| STORAGE_AZURECONNECTIONSTRING | "DefaultEndpointsProtocol=..."   | Connection string for Azure storage                                      |
| STORAGE_AZURECONTAINERNAME    | "redaction"                      | AKS container                                                            |
| WRITE_CONSUMER_TOKEN          | "False"                          | Value to see if we should write a consumer token to a file               |

## Response Format

### Expected AMQP input message:

Either use the legacy format with dossierId and fileId as strings or the new format where absolute paths are used.
A tenant ID can be optionally provided in the message header (key: "X-TENANT-ID")


```json
{
  "targetFilePath": "",
  "responseFilePath": ""
}
```

or

```json
{
   "dossierId": "",
   "fileId": "",
   "targetFileExtension": "",
   "responseFileExtension": ""
}
```

Optionally, the input message can contain a field with the key `"operations"`.

### AMQP output message:


```json
{
  "targetFilePath": "",
  "responseFilePath": ""
}
```

or

```json
{
  "dossierId": "",
  "fileId": ""
}
```

## Usage & API

### Setup

Add the respective version of the pyinfra package to your pyproject.toml file. Make sure to add our gitlab registry as a source.
For now, all internal packages used by pyinfra also have to be added to the pyproject.toml file.
Execute `poetry lock` and `poetry install` to install the packages.

You can look up the latest version of the package in the [gitlab registry](https://gitlab.knecon.com/knecon/research/pyinfra/-/packages).
For the used versions of internal dependencies, please refer to the [pyproject.toml](pyproject.toml) file.

```toml
[tool.poetry.dependencies]
pyinfra = { version = "x.x.x", source = "gitlab-research" }
kn-utils = { version = "x.x.x", source = "gitlab-research" }

[[tool.poetry.source]]
name = "gitlab-research"
url = "https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi/simple"
priority = "explicit"
```

### API

```python
from pyinfra import config
from pyinfra.payload_processing.processor import make_payload_processor
from pyinfra.queue.queue_manager import QueueManager

pyinfra_config = config.get_config()

process_payload = make_payload_processor(process_data, config=pyinfra_config)

queue_manager = QueueManager(pyinfra_config)
queue_manager.start_consuming(process_payload)
```

`process_data` should expect a dict (json) or bytes (pdf) as input and should return a list of results.

## Scripts

### Run pyinfra locally

**Shell 1**: Start minio and rabbitmq containers
```bash
$ cd tests && docker-compose up
```

**Shell 2**: Start pyinfra with callback mock
```bash
$ python scripts/start_pyinfra.py
```

**Shell 3**: Upload dummy content on storage and publish message
```bash
$ python scripts/send_request.py
```

## Tests

Running all tests take a bit longer than you are probably used to, because among other things the required startup times are
quite high for docker-compose dependent tests. This is why the tests are split into two parts. The first part contains all
tests that do not require docker-compose and the second part contains all tests that require docker-compose.
Per default, only the first part is executed, but when releasing a new version, all tests should be executed.