cv-analysis-service/docs/build/html/_sources/README.md.txt

# Keyword-Service

Service to get keywords of a paragraph or whole document.

<!-- TOC -->

- [Keyword-Service](#keyword-service)
  - [API](#api)
    - [REST](#rest)
    - [RabbitMQ](#rabbitmq)
  - [Service Configuration](#service-configuration)
  - [Language](#language)
  - [Usage](#usage)
      - [Run Docker Commands](#run-docker-commands)
      - [Run locally](#run-locally)
- [Upload models to ML Flow](#upload-models-to-ml-flow)

<!-- TOC -->

## API

### REST

The service provides endpoints to extract keywords from a text and to embed a text. For details, download
[OpenAPI schema](docs/openapi_redoc.html) and view it in a browser.

### RabbitMQ

The service listens to a queue and processes the messages. This method is ment to be used for extracting keywords from
whole documents. All RabbitMQ parameters including the queue names are set in environment variables, refer to the
service respective HELM chart for more information.

The input message should be a JSON object with the following structure:

```json
{
  "targetFilePath": string,
  "responseFilePath": string
}
```

The service downloads the file specified in `targetFilePath`. Supported data structures for the target file are:

- simplified text data (signifier key: `paragraphs`)
- structure object data (signifier key: `structureObjects`)

As a response, the service uploads a JSON-structured file (as defined in `responseFilePath`) with the result under the
`data` key. The structure of the response file is as follows:

```javascript
{
    "targetFilePath"
:
    string,
        "responseFilePath"
:
    string,
        // and eventually further fields if present in the input message
        "data"
:
    [
        {
            "keywords": Array[string],
            "paragraphId": int,
            "embedding": Array[float]  // 384 dimensions
        }
    ]
}
```

**Note** that

- the `embedding` key is optional and can be omitted. The service will not calculate the embedding if the environment
  variable `MODEL__COMPUTE_EMBEDDINGS` is set to `false`.
- the service also computes the keywords for the whole document. In this case, the `paragraphId` is set to `-1`.

## Service Configuration

The service is configured via environment variables. The following variables are available:

| Variable                                   | Description                                                                         | Default |
| ------------------------------------------ | ----------------------------------------------------------------------------------- | ------- |
| LOGGING__LEVEL                             | Logging level                                                                       | INFO    |
| MODEL__MAX_KEYWORDS_PER_PARAGRAPH          | Maximum number of keywords per paragraph                                            | 5       |
| MODLE__MAX_KEYWORDS_PER_DOCUMENT           | Maximum number of keywords per document, when set to 0, no keywords are extracted   | 0       |
| MODEL__COMPUTE_EMBEDDINGS                  | Whether to compute keyword embeddings or not                                        | true    |
| MODEL__PREPROCESSING__MIN_PARAGRAPH_LENGTH | Minimum number of characters in a paragraph to be considered for keyword extraction | 1       |
| MODEL__POSTPROCESSING__FILTER_SUBWORDS     | Whether to filter out subwords from the keywords or not                             | true    |

**NOTE** that these variables are subject to change. For the most recent configuration, refer to the service respective
HELM chart.

## Language

Currently, there is an english, a german and a multi-language model for keyword extraction. The models are uploaded to
mlflow and can
be set in the Dockerfile when building the container:

example for german model:

```
ENV AZURE_RESOURCE_GROUP="mriedl"
ENV AZURE_AML_WORKSPACE="azureml-ws"
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
ENV AZURE_AML_MODEL_VERSION="1"
```

and example for english model:

```
ENV AZURE_RESOURCE_GROUP="mriedl"
ENV AZURE_AML_WORKSPACE="azureml-ws"
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
ENV AZURE_AML_MODEL_VERSION="1"
```

## Usage

**Two Options:**

1. REST: Send text per request to endpoint, endpoint returns keywords
2. Queue: Service gets text from queue, model calculates keywords, save keywords in queue

To test the REST endpoint you have to set up an environment and do poetry install (
see https://gitlab.knecon.com/knecon/research/template-python-project for details for setting up poetry)
Then run

```
python ./src/serve.py
```

You don't need to start a queue for that, just ignore the AMQP Error.
Port and host are set in settings.toml .
You can use the FastAPI under 127.0.0.1:8001/docs to send request to endpoint.

You can also test the service with docker:

#### Run Docker Commands

```bash
docker build -t ${IMAGE_NAME} -f Dockerfile --build-arg GITLAB_USER=${GITLAB_USER} \
                                            --build-arg GITLAB_ACCESS_TOKEN=${GITLAB_ACCESS_TOKEN} \
                                            --build-arg AZURE_TENANT_ID=${AZURE_TENANT_ID} \
                                            --build-arg AZURE_SUBSCRIPTION_ID=${AZURE_SUBSCRIPTION_ID} \
                                            --build-arg AZURE_CLIENT_ID=${AZURE_CLIENT_ID} \
                                            --build-arg AZURE_CLIENT_SECRET=${AZURE_CLIENT_SECRET} \
                                            --build-arg AZURE_AML_MODEL_VERSION=${AZURE_AML_MODEL_VERSION} \
                                            --build-arg AZURE_AML_MODEL_NAME=${AZURE_AML_MODEL_NAME} \
                                            --build-arg AZURE_RESOURCE_GROUP=${AZURE_RESOURCE_GROUP} \
                                            --build-arg AZURE_AML_WORKSPACE=${AZURE_AML_WORKSPACE}
```

```bash
docker run --net=host -it --rm --name ${CONTAINER_NAME} ${IMAGE_NAME}
```

#### Run locally

First you need to download the model from mlflow. This can be done with the *"src/ml_flow/download_model.py"* script.
This scripts downloads a model and copies config and model data to the specific locations, such that the model can
be loaded.

For running/testing the keyword extraction locally you can use the *src/tests/test_process.py* script.

Model ist stored and loaded via DVC, you need the connection string under
https://portal.azure.com/#@knecon.com/resource/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourceGroups/taas-rg/providers/Microsoft.Storage/storageAccounts/taassaracer/keys

# Upload models to ML Flow

To upload the models to mlflow, you can use following script: src/mlflow/upload_model.py
For authentication following environment variables need to be set:

```
#AZURE_TENANT_ID=""
#AZURE_SUBSCRIPTION_ID=""
#AZURE_CLIENT_ID=""
#AZURE_CLIENT_SECRET=""
```

Additional settings (resource group, experiment name, etc.) can be specified in the config (
*./src/mlflow/config/azure_config.toml*).
The *upload_model.py* has the following parameters:

```
options:
  -h, --help            show this help message and exit
  -a AZURE_CONFIG, --azure_config AZURE_CONFIG
                        Location of the configuration file for Azure (default: src/mlflow/config/azure_config.toml)
  -b BASE_CONFIG, --base_config BASE_CONFIG
                        Location of the basic training configuration (default: src/mlflow/config/settings_de.toml)


```

the base config contains all information for the models used. Examples for German and
English are placed in */src/mlflow/config/*

Note: Multi-language model tracking does not work for now. After the upload script reports an error, you have to
manually track the
model [here](https://ml.azure.com/experiments?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&tid=b44be368-e4f2-4ade-a089-cd2825458048)
where you can find the run. Adhere to the naming conventions for the model name and versions,
see [here](https://ml.azure.com/model/list?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&tid=b44be368-e4f2-4ade-a089-cd2825458048)