239 lines
18 KiB
Markdown
Executable File
239 lines
18 KiB
Markdown
Executable File
# PyInfra
|
|
|
|
**WARNING**: Compatibility issues currently require manual actions when implementing pyinfra.
|
|
See [Protobuf](#protobuf) for more information.
|
|
|
|
1. [ About ](#about)
|
|
2. [ Configuration ](#configuration)
|
|
3. [ Queue Manager ](#queue-manager)
|
|
4. [ Module Installation ](#module-installation)
|
|
5. [ Scripts ](#scripts)
|
|
6. [ Tests ](#tests)
|
|
7. [ Protobuf ](#protobuf)
|
|
|
|
## About
|
|
|
|
Shared library for the research team, containing code related to infrastructure and communication with other services.
|
|
Offers a simple interface for processing data and sending responses via AMQP, monitoring via Prometheus and storage
|
|
access via S3 or Azure. Also export traces via OpenTelemetry for queue messages and webserver requests.
|
|
|
|
To start, see the [complete example](pyinfra/examples.py) which shows how to use all features of the service and can be
|
|
imported and used directly for default research service pipelines (data ID in message, download data from storage,
|
|
upload result while offering Prometheus monitoring, /health and /ready endpoints and multi tenancy support).
|
|
|
|
## Configuration
|
|
|
|
Configuration is done via `Dynaconf`. This means that you can use environment variables, a `.env` file or `.toml`
|
|
file(s) to configure the service. You can also combine these methods. The precedence is
|
|
`environment variables > .env > .toml`. It is recommended to load settings with the provided
|
|
[`load_settings`](pyinfra/config/loader.py) function, which you can combine with the provided
|
|
[`parse_args`](pyinfra/config/loader.py) function. This allows you to load settings from a `.toml` file or a folder with
|
|
`.toml` files and override them with environment variables.
|
|
|
|
The following table shows all necessary settings. You can find a preconfigured settings file for this service in
|
|
bitbucket. These are the complete settings, you only need all if using all features of the service as described in
|
|
the [complete example](pyinfra/examples.py).
|
|
|
|
| Environment Variable | Internal / .toml Name | Description |
|
|
|--------------------------------------|------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
| LOGGING__LEVEL | logging.level | Log level |
|
|
| METRICS__PROMETHEUS__ENABLED | metrics.prometheus.enabled | Enable Prometheus metrics collection |
|
|
| METRICS__PROMETHEUS__PREFIX | metrics.prometheus.prefix | Prefix for Prometheus metrics (e.g. {product}-{service}) |
|
|
| WEBSERVER__HOST | webserver.host | Host of the webserver (offering e.g. /prometheus, /ready and /health endpoints) |
|
|
| WEBSERVER__PORT | webserver.port | Port of the webserver |
|
|
| RABBITMQ__HOST | rabbitmq.host | Host of the RabbitMQ server |
|
|
| RABBITMQ__PORT | rabbitmq.port | Port of the RabbitMQ server |
|
|
| RABBITMQ__USERNAME | rabbitmq.username | Username for the RabbitMQ server |
|
|
| RABBITMQ__PASSWORD | rabbitmq.password | Password for the RabbitMQ server |
|
|
| RABBITMQ__HEARTBEAT | rabbitmq.heartbeat | Heartbeat for the RabbitMQ server |
|
|
| RABBITMQ__CONNECTION_SLEEP | rabbitmq.connection_sleep | Sleep time intervals during message processing. Has to be a divider of heartbeat, and shouldn't be too big, since only in these intervals queue interactions happen (like receiving new messages) This is also the minimum time the service needs to process a message. |
|
|
| RABBITMQ__INPUT_QUEUE | rabbitmq.input_queue | Name of the input queue |
|
|
| RABBITMQ__OUTPUT_QUEUE | rabbitmq.output_queue | Name of the output queue |
|
|
| RABBITMQ__DEAD_LETTER_QUEUE | rabbitmq.dead_letter_queue | Name of the dead letter queue |
|
|
| STORAGE__BACKEND | storage.backend | Storage backend to use (currently only "s3" and "azure" are supported) |
|
|
| STORAGE__S3__BUCKET | storage.s3.bucket | Name of the S3 bucket |
|
|
| STORAGE__S3__ENDPOINT | storage.s3.endpoint | Endpoint of the S3 server |
|
|
| STORAGE__S3__KEY | storage.s3.key | Access key for the S3 server |
|
|
| STORAGE__S3__SECRET | storage.s3.secret | Secret key for the S3 server |
|
|
| STORAGE__S3__REGION | storage.s3.region | Region of the S3 server |
|
|
| STORAGE__AZURE__CONTAINER | storage.azure.container_name | Name of the Azure container |
|
|
| STORAGE__AZURE__CONNECTION_STRING | storage.azure.connection_string | Connection string for the Azure server |
|
|
| STORAGE__TENANT_SERVER__PUBLIC_KEY | storage.tenant_server.public_key | Public key of the tenant server |
|
|
| STORAGE__TENANT_SERVER__ENDPOINT | storage.tenant_server.endpoint | Endpoint of the tenant server |
|
|
| TRACING__OPENTELEMETRY__ENDPOINT | tracing.opentelemetry.endpoint | Endpoint to which OpenTelemetry traces are exported |
|
|
| TRACING__OPENTELEMETRY__SERVICE_NAME | tracing.opentelemetry.service_name | Name of the service as displayed in the traces collected |
|
|
|
|
### OpenTelemetry
|
|
|
|
Open telemetry (vis its Python SDK) is set up to be as unobtrusive as possible; for typical use cases it can be
|
|
configured
|
|
from environment variables, without additional work in the microservice app, although additional confiuration is
|
|
possible.
|
|
|
|
`TRACING__OPENTELEMETRY__ENDPOINT` should typically be set
|
|
to `http://otel-collector-opentelemetry-collector.otel-collector:4318/v1/traces`.
|
|
|
|
## Queue Manager
|
|
|
|
The queue manager is responsible for consuming messages from the input queue, processing them and sending the response
|
|
to the output queue. The default callback also downloads data from the storage and uploads the result to the storage.
|
|
The response message does not contain the data itself, but the identifiers from the input message (including headers
|
|
beginning with "X-").
|
|
|
|
### Standalone Usage
|
|
|
|
```python
|
|
from pyinfra.queue.manager import QueueManager
|
|
from pyinfra.queue.callback import make_download_process_upload_callback, DataProcessor
|
|
from pyinfra.config.loader import load_settings
|
|
|
|
settings = load_settings("path/to/settings")
|
|
processing_function: DataProcessor # function should expect a dict (json) or bytes (pdf) as input and should return a json serializable object.
|
|
|
|
queue_manager = QueueManager(settings)
|
|
callback = make_download_process_upload_callback(processing_function, settings)
|
|
queue_manager.start_consuming(make_download_process_upload_callback(callback, settings))
|
|
```
|
|
|
|
### Usage in a Service
|
|
|
|
This is the recommended way to use the module. This includes the webserver, Prometheus metrics and health endpoints.
|
|
Custom endpoints can be added by adding a new route to the `app` object beforehand. Settings are loaded from files
|
|
specified as CLI arguments (e.g. `--settings-path path/to/settings.toml`). The values can also be set or overriden via
|
|
environment variables (e.g. `LOGGING__LEVEL=DEBUG`).
|
|
|
|
The callback can be replaced with a custom one, for example if the data to process is contained in the message itself
|
|
and not on the storage.
|
|
|
|
```python
|
|
from pyinfra.config.loader import load_settings, parse_settings_path
|
|
from pyinfra.examples import start_standard_queue_consumer
|
|
from pyinfra.queue.callback import make_download_process_upload_callback, DataProcessor
|
|
|
|
processing_function: DataProcessor
|
|
|
|
arguments = parse_settings_path()
|
|
settings = load_settings(arguments.settings_path)
|
|
|
|
callback = make_download_process_upload_callback(processing_function, settings)
|
|
start_standard_queue_consumer(callback, settings) # optionally also pass a fastAPI app object with preconfigured routes
|
|
```
|
|
|
|
### AMQP input message:
|
|
|
|
Either use the legacy format with dossierId and fileId as strings or the new format where absolute paths are used.
|
|
All headers beginning with "X-" are forwarded to the message processor, and returned in the response message (e.g.
|
|
"X-TENANT-ID" is used to acquire storage information for the tenant).
|
|
|
|
```json
|
|
{
|
|
"targetFilePath": "",
|
|
"responseFilePath": ""
|
|
}
|
|
```
|
|
|
|
or
|
|
|
|
```json
|
|
{
|
|
"dossierId": "",
|
|
"fileId": "",
|
|
"targetFileExtension": "",
|
|
"responseFileExtension": ""
|
|
}
|
|
```
|
|
|
|
## Module Installation
|
|
|
|
Add the respective version of the pyinfra package to your pyproject.toml file. Make sure to add our gitlab registry as a
|
|
source.
|
|
For now, all internal packages used by pyinfra also have to be added to the pyproject.toml file (namely kn-utils).
|
|
Execute `poetry lock` and `poetry install` to install the packages.
|
|
|
|
You can look up the latest version of the package in
|
|
the [gitlab registry](https://gitlab.knecon.com/knecon/research/pyinfra/-/packages).
|
|
For the used versions of internal dependencies, please refer to the [pyproject.toml](pyproject.toml) file.
|
|
|
|
```toml
|
|
[tool.poetry.dependencies]
|
|
pyinfra = { version = "x.x.x", source = "gitlab-research" }
|
|
kn-utils = { version = "x.x.x", source = "gitlab-research" }
|
|
|
|
[[tool.poetry.source]]
|
|
name = "gitlab-research"
|
|
url = "https://gitlab.knecon.com/api/v4/groups/19/-/packages/pypi/simple"
|
|
priority = "explicit"
|
|
```
|
|
|
|
## Scripts
|
|
|
|
### Run pyinfra locally
|
|
|
|
**Shell 1**: Start minio and rabbitmq containers
|
|
|
|
```bash
|
|
$ cd tests && docker compose up
|
|
```
|
|
|
|
**Shell 2**: Start pyinfra with callback mock
|
|
|
|
```bash
|
|
$ python scripts/start_pyinfra.py
|
|
```
|
|
|
|
**Shell 3**: Upload dummy content on storage and publish message
|
|
|
|
```bash
|
|
$ python scripts/send_request.py
|
|
```
|
|
|
|
## Tests
|
|
|
|
Tests require a running minio and rabbitmq container, meaning you have to run `docker compose up` in the tests folder
|
|
before running the tests.
|
|
|
|
## Protobuf
|
|
|
|
### Compatibility Issue Workaround
|
|
|
|
**Note**: As of date 24/07/16, the currently used `opentelemetry-exporter-otlp-proto-http` version `1.25.0` requires
|
|
a `protobuf` version < `5.x.x` and is not compatible with the required version `5.27.0` we need to use the compiled
|
|
schemata. Therefore, it is necessary for all `pyinfra` implementing services to install required version via pip over
|
|
the existing one with:
|
|
|
|
```bash
|
|
pip install protobuf==5.27.2
|
|
```
|
|
|
|
Do this in your development environment and add it to the Dockerfile after the `poetry install` command.
|
|
We have to fix this ASAP, ideally `opentelemetry-exporter-otlp-proto-http` gets upgraded to work with `protobuf > 5.x.x.`,
|
|
but for now this is the only workaround with reasonable effort. The `opentelemetry` functionality is not affected by
|
|
this workaround.
|
|
|
|
### Install Protobuf Compiler
|
|
|
|
**Linux**
|
|
|
|
1. Download the latest version of the protobuf compiler from https://github.com/protocolbuffers/protobuf/releases
|
|
2. Extract the files under `$HOME/.local` or another directory of your choice
|
|
```bash
|
|
unzip protoc-<version>-linux-x86_64.zip -d $HOME/.local
|
|
```
|
|
3. Ensure that the `bin` directory is in your `PATH` by adding the following line to your `.bashrc` or `.zshrc`:
|
|
```bash
|
|
export PATH="$PATH:$HOME/.local/bin"
|
|
```
|
|
|
|
### Compile Protobuf Files
|
|
|
|
1. Ensure that the protobuf compiler is installed on your system. You can check this by running:
|
|
```bash
|
|
protoc --version
|
|
```
|
|
2. Compile proto files:
|
|
```bash
|
|
protoc --proto_path=./config/proto --python_out=./pyinfra/proto ./config/proto/*.proto
|
|
```
|
|
3. Manually adjust import statements in the generated files to match the package structure, e.g.:
|
|
`import EntryData_pb2 as EntryData__pb2` -> `import pyinfra.proto.EntryData_pb2 as EntryData__pb2`.
|
|
This does not work automatically because the generated files are not in the same directory as the proto files. |