204 lines
8.0 KiB
Plaintext
204 lines
8.0 KiB
Plaintext
# Keyword-Service
|
|
|
|
Service to get keywords of a paragraph or whole document.
|
|
|
|
<!-- TOC -->
|
|
|
|
- [Keyword-Service](#keyword-service)
|
|
- [API](#api)
|
|
- [REST](#rest)
|
|
- [RabbitMQ](#rabbitmq)
|
|
- [Service Configuration](#service-configuration)
|
|
- [Language](#language)
|
|
- [Usage](#usage)
|
|
- [Run Docker Commands](#run-docker-commands)
|
|
- [Run locally](#run-locally)
|
|
- [Upload models to ML Flow](#upload-models-to-ml-flow)
|
|
|
|
<!-- TOC -->
|
|
|
|
## API
|
|
|
|
### REST
|
|
|
|
The service provides endpoints to extract keywords from a text and to embed a text. For details, download
|
|
[OpenAPI schema](docs/openapi_redoc.html) and view it in a browser.
|
|
|
|
### RabbitMQ
|
|
|
|
The service listens to a queue and processes the messages. This method is ment to be used for extracting keywords from
|
|
whole documents. All RabbitMQ parameters including the queue names are set in environment variables, refer to the
|
|
service respective HELM chart for more information.
|
|
|
|
The input message should be a JSON object with the following structure:
|
|
|
|
```json
|
|
{
|
|
"targetFilePath": string,
|
|
"responseFilePath": string
|
|
}
|
|
```
|
|
|
|
The service downloads the file specified in `targetFilePath`. Supported data structures for the target file are:
|
|
|
|
- simplified text data (signifier key: `paragraphs`)
|
|
- structure object data (signifier key: `structureObjects`)
|
|
|
|
As a response, the service uploads a JSON-structured file (as defined in `responseFilePath`) with the result under the
|
|
`data` key. The structure of the response file is as follows:
|
|
|
|
```javascript
|
|
{
|
|
"targetFilePath"
|
|
:
|
|
string,
|
|
"responseFilePath"
|
|
:
|
|
string,
|
|
// and eventually further fields if present in the input message
|
|
"data"
|
|
:
|
|
[
|
|
{
|
|
"keywords": Array[string],
|
|
"paragraphId": int,
|
|
"embedding": Array[float] // 384 dimensions
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Note** that
|
|
|
|
- the `embedding` key is optional and can be omitted. The service will not calculate the embedding if the environment
|
|
variable `MODEL__COMPUTE_EMBEDDINGS` is set to `false`.
|
|
- the service also computes the keywords for the whole document. In this case, the `paragraphId` is set to `-1`.
|
|
|
|
## Service Configuration
|
|
|
|
The service is configured via environment variables. The following variables are available:
|
|
|
|
| Variable | Description | Default |
|
|
| ------------------------------------------ | ----------------------------------------------------------------------------------- | ------- |
|
|
| LOGGING__LEVEL | Logging level | INFO |
|
|
| MODEL__MAX_KEYWORDS_PER_PARAGRAPH | Maximum number of keywords per paragraph | 5 |
|
|
| MODLE__MAX_KEYWORDS_PER_DOCUMENT | Maximum number of keywords per document, when set to 0, no keywords are extracted | 0 |
|
|
| MODEL__COMPUTE_EMBEDDINGS | Whether to compute keyword embeddings or not | true |
|
|
| MODEL__PREPROCESSING__MIN_PARAGRAPH_LENGTH | Minimum number of characters in a paragraph to be considered for keyword extraction | 1 |
|
|
| MODEL__POSTPROCESSING__FILTER_SUBWORDS | Whether to filter out subwords from the keywords or not | true |
|
|
|
|
**NOTE** that these variables are subject to change. For the most recent configuration, refer to the service respective
|
|
HELM chart.
|
|
|
|
## Language
|
|
|
|
Currently, there is an english, a german and a multi-language model for keyword extraction. The models are uploaded to
|
|
mlflow and can
|
|
be set in the Dockerfile when building the container:
|
|
|
|
example for german model:
|
|
|
|
```
|
|
ENV AZURE_RESOURCE_GROUP="mriedl"
|
|
ENV AZURE_AML_WORKSPACE="azureml-ws"
|
|
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
|
|
ENV AZURE_AML_MODEL_VERSION="1"
|
|
```
|
|
|
|
and example for english model:
|
|
|
|
```
|
|
ENV AZURE_RESOURCE_GROUP="mriedl"
|
|
ENV AZURE_AML_WORKSPACE="azureml-ws"
|
|
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
|
|
ENV AZURE_AML_MODEL_VERSION="1"
|
|
```
|
|
|
|
## Usage
|
|
|
|
**Two Options:**
|
|
|
|
1. REST: Send text per request to endpoint, endpoint returns keywords
|
|
2. Queue: Service gets text from queue, model calculates keywords, save keywords in queue
|
|
|
|
To test the REST endpoint you have to set up an environment and do poetry install (
|
|
see https://gitlab.knecon.com/knecon/research/template-python-project for details for setting up poetry)
|
|
Then run
|
|
|
|
```
|
|
python ./src/serve.py
|
|
```
|
|
|
|
You don't need to start a queue for that, just ignore the AMQP Error.
|
|
Port and host are set in settings.toml .
|
|
You can use the FastAPI under 127.0.0.1:8001/docs to send request to endpoint.
|
|
|
|
You can also test the service with docker:
|
|
|
|
#### Run Docker Commands
|
|
|
|
```bash
|
|
docker build -t ${IMAGE_NAME} -f Dockerfile --build-arg GITLAB_USER=${GITLAB_USER} \
|
|
--build-arg GITLAB_ACCESS_TOKEN=${GITLAB_ACCESS_TOKEN} \
|
|
--build-arg AZURE_TENANT_ID=${AZURE_TENANT_ID} \
|
|
--build-arg AZURE_SUBSCRIPTION_ID=${AZURE_SUBSCRIPTION_ID} \
|
|
--build-arg AZURE_CLIENT_ID=${AZURE_CLIENT_ID} \
|
|
--build-arg AZURE_CLIENT_SECRET=${AZURE_CLIENT_SECRET} \
|
|
--build-arg AZURE_AML_MODEL_VERSION=${AZURE_AML_MODEL_VERSION} \
|
|
--build-arg AZURE_AML_MODEL_NAME=${AZURE_AML_MODEL_NAME} \
|
|
--build-arg AZURE_RESOURCE_GROUP=${AZURE_RESOURCE_GROUP} \
|
|
--build-arg AZURE_AML_WORKSPACE=${AZURE_AML_WORKSPACE}
|
|
```
|
|
|
|
```bash
|
|
docker run --net=host -it --rm --name ${CONTAINER_NAME} ${IMAGE_NAME}
|
|
```
|
|
|
|
#### Run locally
|
|
|
|
First you need to download the model from mlflow. This can be done with the *"src/ml_flow/download_model.py"* script.
|
|
This scripts downloads a model and copies config and model data to the specific locations, such that the model can
|
|
be loaded.
|
|
|
|
For running/testing the keyword extraction locally you can use the *src/tests/test_process.py* script.
|
|
|
|
Model ist stored and loaded via DVC, you need the connection string under
|
|
https://portal.azure.com/#@knecon.com/resource/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourceGroups/taas-rg/providers/Microsoft.Storage/storageAccounts/taassaracer/keys
|
|
|
|
# Upload models to ML Flow
|
|
|
|
To upload the models to mlflow, you can use following script: src/mlflow/upload_model.py
|
|
For authentication following environment variables need to be set:
|
|
|
|
```
|
|
#AZURE_TENANT_ID=""
|
|
#AZURE_SUBSCRIPTION_ID=""
|
|
#AZURE_CLIENT_ID=""
|
|
#AZURE_CLIENT_SECRET=""
|
|
```
|
|
|
|
Additional settings (resource group, experiment name, etc.) can be specified in the config (
|
|
*./src/mlflow/config/azure_config.toml*).
|
|
The *upload_model.py* has the following parameters:
|
|
|
|
```
|
|
options:
|
|
-h, --help show this help message and exit
|
|
-a AZURE_CONFIG, --azure_config AZURE_CONFIG
|
|
Location of the configuration file for Azure (default: src/mlflow/config/azure_config.toml)
|
|
-b BASE_CONFIG, --base_config BASE_CONFIG
|
|
Location of the basic training configuration (default: src/mlflow/config/settings_de.toml)
|
|
|
|
|
|
```
|
|
|
|
the base config contains all information for the models used. Examples for German and
|
|
English are placed in */src/mlflow/config/*
|
|
|
|
Note: Multi-language model tracking does not work for now. After the upload script reports an error, you have to
|
|
manually track the
|
|
model [here](https://ml.azure.com/experiments?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&tid=b44be368-e4f2-4ade-a089-cd2825458048)
|
|
where you can find the run. Adhere to the naming conventions for the model name and versions,
|
|
see [here](https://ml.azure.com/model/list?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&tid=b44be368-e4f2-4ade-a089-cd2825458048)
|
|
|