The service listens to a queue and processes the messages. This method is ment to be used for extracting keywords from whole documents. All RabbitMQ parameters including the queue names are set in environment variables, refer to the service respective HELM chart for more information.

The input message should be a JSON object with the following structure:

{
  "targetFilePath": string,
  "responseFilePath": string
}

The service downloads the file specified in targetFilePath. Supported data structures for the target file are:

simplified text data (signifier key: paragraphs)
structure object data (signifier key: structureObjects)

As a response, the service uploads a JSON-structured file (as defined in responseFilePath) with the result under the data key. The structure of the response file is as follows:

{
    "targetFilePath"
:
    string,
        "responseFilePath"
:
    string,
        // and eventually further fields if present in the input message      
        "data"
:
    [
        {
            "keywords": Array[string],
            "paragraphId": int,
            "embedding": Array[float]  // 384 dimensions
        }
    ]
}

Note that

the embedding key is optional and can be omitted. The service will not calculate the embedding if the environment variable MODEL__COMPUTE_EMBEDDINGS is set to false.
the service also computes the keywords for the whole document. In this case, the paragraphId is set to -1.

Service Configuration

The service is configured via environment variables. The following variables are available:

Variable	Description	Default
LOGGING__LEVEL	Logging level	INFO
MODEL__MAX_KEYWORDS_PER_PARAGRAPH	Maximum number of keywords per paragraph	5
MODLE__MAX_KEYWORDS_PER_DOCUMENT	Maximum number of keywords per document, when set to 0, no keywords are extracted	0
MODEL__COMPUTE_EMBEDDINGS	Whether to compute keyword embeddings or not	true
MODEL__PREPROCESSING__MIN_PARAGRAPH_LENGTH	Minimum number of characters in a paragraph to be considered for keyword extraction	1
MODEL__POSTPROCESSING__FILTER_SUBWORDS	Whether to filter out subwords from the keywords or not	true

NOTE that these variables are subject to change. For the most recent configuration, refer to the service respective HELM chart.

Language

Currently, there is an english, a german and a multi-language model for keyword extraction. The models are uploaded to mlflow and can be set in the Dockerfile when building the container:

example for german model:

ENV AZURE_RESOURCE_GROUP="mriedl"
ENV AZURE_AML_WORKSPACE="azureml-ws"
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
ENV AZURE_AML_MODEL_VERSION="1"

and example for english model:

ENV AZURE_RESOURCE_GROUP="mriedl"
ENV AZURE_AML_WORKSPACE="azureml-ws"
ENV AZURE_AML_MODEL_NAME="keyword-extraction-de"
ENV AZURE_AML_MODEL_VERSION="1"

Usage

Two Options:

REST: Send text per request to endpoint, endpoint returns keywords
Queue: Service gets text from queue, model calculates keywords, save keywords in queue

To test the REST endpoint you have to set up an environment and do poetry install ( see https://gitlab.knecon.com/knecon/research/template-python-project for details for setting up poetry) Then run

python ./src/serve.py

You don't need to start a queue for that, just ignore the AMQP Error. Port and host are set in settings.toml . You can use the FastAPI under 127.0.0.1:8001/docs to send request to endpoint.

You can also test the service with docker:

Run Docker Commands

docker build -t ${IMAGE_NAME} -f Dockerfile --build-arg GITLAB_USER=${GITLAB_USER} \
                                            --build-arg GITLAB_ACCESS_TOKEN=${GITLAB_ACCESS_TOKEN} \
                                            --build-arg AZURE_TENANT_ID=${AZURE_TENANT_ID} \
                                            --build-arg AZURE_SUBSCRIPTION_ID=${AZURE_SUBSCRIPTION_ID} \
                                            --build-arg AZURE_CLIENT_ID=${AZURE_CLIENT_ID} \
                                            --build-arg AZURE_CLIENT_SECRET=${AZURE_CLIENT_SECRET} \
                                            --build-arg AZURE_AML_MODEL_VERSION=${AZURE_AML_MODEL_VERSION} \
                                            --build-arg AZURE_AML_MODEL_NAME=${AZURE_AML_MODEL_NAME} \
                                            --build-arg AZURE_RESOURCE_GROUP=${AZURE_RESOURCE_GROUP} \
                                            --build-arg AZURE_AML_WORKSPACE=${AZURE_AML_WORKSPACE}

docker run --net=host -it --rm --name ${CONTAINER_NAME} ${IMAGE_NAME}

Run locally

First you need to download the model from mlflow. This can be done with the "src/ml_flow/download_model.py" script. This scripts downloads a model and copies config and model data to the specific locations, such that the model can be loaded.

For running/testing the keyword extraction locally you can use the src/tests/test_process.py script.

Model ist stored and loaded via DVC, you need the connection string under https://portal.azure.com/#@knecon.com/resource/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourceGroups/taas-rg/providers/Microsoft.Storage/storageAccounts/taassaracer/keys

Upload models to ML Flow

To upload the models to mlflow, you can use following script: src/mlflow/upload_model.py For authentication following environment variables need to be set:

#AZURE_TENANT_ID=""
#AZURE_SUBSCRIPTION_ID=""
#AZURE_CLIENT_ID=""
#AZURE_CLIENT_SECRET=""

Additional settings (resource group, experiment name, etc.) can be specified in the config ( ./src/mlflow/config/azure_config.toml). The upload_model.py has the following parameters:

options:
  -h, --help            show this help message and exit
  -a AZURE_CONFIG, --azure_config AZURE_CONFIG
                        Location of the configuration file for Azure (default: src/mlflow/config/azure_config.toml)
  -b BASE_CONFIG, --base_config BASE_CONFIG
                        Location of the basic training configuration (default: src/mlflow/config/settings_de.toml)

the base config contains all information for the models used. Examples for German and English are placed in /src/mlflow/config/

Note: Multi-language model tracking does not work for now. After the upload script reports an error, you have to manually track the model here where you can find the run. Adhere to the naming conventions for the model name and versions, see here