pyinfra/README.md
Julius Unverfehrt 84c4e7601f Update kn-utils package
Update kn-utils for missing loglevels fix, which is needed for queue
manager error logging.
2023-08-30 15:58:29 +02:00

154 lines
6.7 KiB
Markdown
Executable File

# PyInfra
1. [ About ](#about)
2. [ Configuration ](#configuration)
3. [ Response Format ](#response-format)
4. [ Usage & API ](#usage--api)
5. [ Scripts ](#scripts)
6. [ Tests ](#tests)
## About
Common Module with the infrastructure to deploy Research Projects.
The Infrastructure expects to be deployed in the same Pod / local environment as the analysis container and handles all outbound communication.
## Configuration
A configuration is located in `/config.yaml`. All relevant variables can be configured via exporting environment variables.
| Environment Variable | Default | Description |
|-------------------------------|----------------------------------|--------------------------------------------------------------------------|
| LOGGING_LEVEL_ROOT | "DEBUG" | Logging level for service logger |
| MONITORING_ENABLED | True | Enables Prometheus monitoring |
| PROMETHEUS_METRIC_PREFIX | "redactmanager_research_service" | Prometheus metric prefix, per convention '{product_name}_{service name}' |
| PROMETHEUS_HOST | "127.0.0.1" | Prometheus webserver address |
| PROMETHEUS_PORT | 8080 | Prometheus webserver port |
| RABBITMQ_HOST | "localhost" | RabbitMQ host address |
| RABBITMQ_PORT | "5672" | RabbitMQ host port |
| RABBITMQ_USERNAME | "user" | RabbitMQ username |
| RABBITMQ_PASSWORD | "bitnami" | RabbitMQ password |
| RABBITMQ_HEARTBEAT | 60 | Controls AMQP heartbeat timeout in seconds |
| RABBITMQ_CONNECTION_SLEEP | 5 | Controls AMQP connection sleep timer in seconds |
| REQUEST_QUEUE | "request_queue" | Requests to service |
| RESPONSE_QUEUE | "response_queue" | Responses by service |
| DEAD_LETTER_QUEUE | "dead_letter_queue" | Messages that failed to process |
| STORAGE_BACKEND | "s3" | The type of storage to use {s3, azure} |
| STORAGE_BUCKET | "redaction" | The bucket / container to pull files specified in queue requests from |
| STORAGE_ENDPOINT | "http://127.0.0.1:9000" | Endpoint for s3 storage |
| STORAGE_KEY | "root" | User for s3 storage |
| STORAGE_SECRET | "password" | Password for s3 storage |
| STORAGE_AZURECONNECTIONSTRING | "DefaultEndpointsProtocol=..." | Connection string for Azure storage |
| STORAGE_AZURECONTAINERNAME | "redaction" | AKS container |
| WRITE_CONSUMER_TOKEN | "False" | Value to see if we should write a consumer token to a file |
## Response Format
### Expected AMQP input message:
Either use the legacy format with dossierId and fileId as strings or the new format where absolute paths are used.
A tenant ID can be optionally provided in the message header (key: "X-TENANT-ID")
```json
{
"targetFilePath": "",
"responseFilePath": ""
}
```
or
```json
{
"dossierId": "",
"fileId": "",
"targetFileExtension": "",
"responseFileExtension": ""
}
```
Optionally, the input message can contain a field with the key `"operations"`.
### AMQP output message:
```json
{
"targetFilePath": "",
"responseFilePath": ""
}
```
or
```json
{
"dossierId": "",
"fileId": ""
}
```
## Usage & API
### Setup
Add the respective version of the pyinfra package to your pyproject.toml file. Make sure to add our gitlab registry as a source.
For now, all internal packages used by pyinfra also have to be added to the pyproject.toml file.
Execute `poetry lock` and `poetry install` to install the packages.
You can look up the latest version of the package in the [gitlab registry](https://gitlab.knecon.com/knecon/research/pyinfra/-/packages).
For the used versions of internal dependencies, please refer to the [pyproject.toml](pyproject.toml) file.
```toml
[tool.poetry.dependencies]
pyinfra = { version = "x.x.x", source = "gitlab-research" }
kn-utils = { version = "x.x.x", source = "gitlab-research" }
[[tool.poetry.source]]
name = "gitlab-research"
url = "https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi/simple"
priority = "explicit"
```
### API
```python
from pyinfra import config
from pyinfra.payload_processing.processor import make_payload_processor
from pyinfra.queue.queue_manager import QueueManager
pyinfra_config = config.get_config()
process_payload = make_payload_processor(process_data, config=pyinfra_config)
queue_manager = QueueManager(pyinfra_config)
queue_manager.start_consuming(process_payload)
```
`process_data` should expect a dict (json) or bytes (pdf) as input and should return a list of results.
## Scripts
### Run pyinfra locally
**Shell 1**: Start minio and rabbitmq containers
```bash
$ cd tests && docker-compose up
```
**Shell 2**: Start pyinfra with callback mock
```bash
$ python scripts/start_pyinfra.py
```
**Shell 3**: Upload dummy content on storage and publish message
```bash
$ python scripts/send_request.py
```
## Tests
Running all tests take a bit longer than you are probably used to, because among other things the required startup times are
quite high for docker-compose dependent tests. This is why the tests are split into two parts. The first part contains all
tests that do not require docker-compose and the second part contains all tests that require docker-compose.
Per default, only the first part is executed, but when releasing a new version, all tests should be executed.