cv-analysis-service

{
    "body": "<section id=\"keyword-service\">\n<h1>Keyword-Service<a class=\"headerlink\" href=\"#keyword-service\" title=\"Link to this heading\">#</a></h1>\n<p>Service to get keywords of a paragraph or whole document.</p>\n<!-- TOC --><ul class=\"simple\">\n<li><p><a class=\"reference external\" href=\"#keyword-service\">Keyword-Service</a></p>\n<ul>\n<li><p><a class=\"reference external\" href=\"#api\">API</a></p>\n<ul>\n<li><p><a class=\"reference external\" href=\"#rest\">REST</a></p></li>\n<li><p><a class=\"reference external\" href=\"#rabbitmq\">RabbitMQ</a></p></li>\n</ul>\n</li>\n<li><p><a class=\"reference external\" href=\"#service-configuration\">Service Configuration</a></p></li>\n<li><p><a class=\"reference external\" href=\"#language\">Language</a></p></li>\n<li><p><a class=\"reference external\" href=\"#usage\">Usage</a></p>\n<ul>\n<li><p><a class=\"reference external\" href=\"#run-docker-commands\">Run Docker Commands</a></p></li>\n<li><p><a class=\"reference external\" href=\"#run-locally\">Run locally</a></p></li>\n</ul>\n</li>\n</ul>\n</li>\n<li><p><a class=\"reference external\" href=\"#upload-models-to-ml-flow\">Upload models to ML Flow</a></p></li>\n</ul>\n<!-- TOC --><section id=\"api\">\n<h2>API<a class=\"headerlink\" href=\"#api\" title=\"Link to this heading\">#</a></h2>\n<section id=\"rest\">\n<h3>REST<a class=\"headerlink\" href=\"#rest\" title=\"Link to this heading\">#</a></h3>\n<p>The service provides endpoints to extract keywords from a text and to embed a text. For details, download\n<a class=\"reference external\" href=\"docs/openapi_redoc.html\">OpenAPI schema</a> and view it in a browser.</p>\n</section>\n<section id=\"rabbitmq\">\n<h3>RabbitMQ<a class=\"headerlink\" href=\"#rabbitmq\" title=\"Link to this heading\">#</a></h3>\n<p>The service listens to a queue and processes the messages. This method is ment to be used for extracting keywords from\nwhole documents. All RabbitMQ parameters including the queue names are set in environment variables, refer to the\nservice respective HELM chart for more information.</p>\n<p>The input message should be a JSON object with the following structure:</p>\n<div class=\"highlight-json notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"p\">{</span>\n<span class=\"w\">  </span><span class=\"nt\">&quot;targetFilePath&quot;</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"err\">s</span><span class=\"kc\">tr</span><span class=\"err\">i</span><span class=\"kc\">n</span><span class=\"err\">g</span><span class=\"p\">,</span>\n<span class=\"w\">  </span><span class=\"nt\">&quot;responseFilePath&quot;</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"err\">s</span><span class=\"kc\">tr</span><span class=\"err\">i</span><span class=\"kc\">n</span><span class=\"err\">g</span>\n<span class=\"p\">}</span>\n</pre></div>\n</div>\n<p>The service downloads the file specified in <code class=\"docutils literal notranslate\"><span class=\"pre\">targetFilePath</span></code>. Supported data structures for the target file are:</p>\n<ul class=\"simple\">\n<li><p>simplified text data (signifier key: <code class=\"docutils literal notranslate\"><span class=\"pre\">paragraphs</span></code>)</p></li>\n<li><p>structure object data (signifier key: <code class=\"docutils literal notranslate\"><span class=\"pre\">structureObjects</span></code>)</p></li>\n</ul>\n<p>As a response, the service uploads a JSON-structured file (as defined in <code class=\"docutils literal notranslate\"><span class=\"pre\">responseFilePath</span></code>) with the result under the\n<code class=\"docutils literal notranslate\"><span class=\"pre\">data</span></code> key. The structure of the response file is as follows:</p>\n<div class=\"highlight-javascript notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"p\">{</span>\n<span class=\"w\">    </span><span class=\"s2\">&quot;targetFilePath&quot;</span>\n<span class=\"o\">:</span>\n<span class=\"w\">    </span><span class=\"nx\">string</span><span class=\"p\">,</span>\n<span class=\"w\">        </span><span class=\"s2\">&quot;responseFilePath&quot;</span>\n<span class=\"o\">:</span>\n<span class=\"w\">    </span><span class=\"nx\">string</span><span class=\"p\">,</span>\n<span class=\"w\">        </span><span class=\"c1\">// and eventually further fields if present in the input message      </span>\n<span class=\"w\">        </span><span class=\"s2\">&quot;data&quot;</span>\n<span class=\"o\">:</span>\n<span class=\"w\">    </span><span class=\"p\">[</span>\n<span class=\"w\">        </span><span class=\"p\">{</span>\n<span class=\"w\">            </span><span class=\"s2\">&quot;keywords&quot;</span><span class=\"o\">:</span><span class=\"w\"> </span><span class=\"nb\">Array</span><span class=\"p\">[</span><span class=\"nx\">string</span><span class=\"p\">],</span>\n<span class=\"w\">            </span><span class=\"s2\">&quot;paragraphId&quot;</span><span class=\"o\">:</span><span class=\"w\"> </span><span class=\"kr\">int</span><span class=\"p\">,</span>\n<span class=\"w\">            </span><span class=\"s2\">&quot;embedding&quot;</span><span class=\"o\">:</span><span class=\"w\"> </span><span class=\"nb\">Array</span><span class=\"p\">[</span><span class=\"kr\">float</span><span class=\"p\">]</span><span class=\"w\">  </span><span class=\"c1\">// 384 dimensions</span>\n<span class=\"w\">        </span><span class=\"p\">}</span>\n<span class=\"w\">    </span><span class=\"p\">]</span>\n<span class=\"p\">}</span>\n</pre></div>\n</div>\n<p><strong>Note</strong> that</p>\n<ul class=\"simple\">\n<li><p>the <code class=\"docutils literal notranslate\"><span class=\"pre\">embedding</span></code> key is optional and can be omitted. The service will not calculate the embedding if the environment\nvariable <code class=\"docutils literal notranslate\"><span class=\"pre\">MODEL__COMPUTE_EMBEDDINGS</span></code> is set to <code class=\"docutils literal notranslate\"><span class=\"pre\">false</span></code>.</p></li>\n<li><p>the service also computes the keywords for the whole document. In this case, the <code class=\"docutils literal notranslate\"><span class=\"pre\">paragraphId</span></code> is set to <code class=\"docutils literal notranslate\"><span class=\"pre\">-1</span></code>.</p></li>\n</ul>\n</section>\n</section>\n<section id=\"service-configuration\">\n<h2>Service Configuration<a class=\"headerlink\" href=\"#service-configuration\" title=\"Link to this heading\">#</a></h2>\n<p>The service is configured via environment variables. The following variables are available:</p>\n<p>| Variable                                   | Description                                                                         | Default |\n| \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014 | \u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2013 | \u2014\u2014- |\n| LOGGING__LEVEL                             | Logging level                                                                       | INFO    |\n| MODEL__MAX_KEYWORDS_PER_PARAGRAPH          | Maximum number of keywords per paragraph                                            | 5       |\n| MODLE__MAX_KEYWORDS_PER_DOCUMENT           | Maximum number of keywords per document, when set to 0, no keywords are extracted   | 0       |\n| MODEL__COMPUTE_EMBEDDINGS                  | Whether to compute keyword embeddings or not                                        | true    |\n| MODEL__PREPROCESSING__MIN_PARAGRAPH_LENGTH | Minimum number of characters in a paragraph to be considered for keyword extraction | 1       |\n| MODEL__POSTPROCESSING__FILTER_SUBWORDS     | Whether to filter out subwords from the keywords or not                             | true    |</p>\n<p><strong>NOTE</strong> that these variables are subject to change. For the most recent configuration, refer to the service respective\nHELM chart.</p>\n</section>\n<section id=\"language\">\n<h2>Language<a class=\"headerlink\" href=\"#language\" title=\"Link to this heading\">#</a></h2>\n<p>Currently, there is an english, a german and a multi-language model for keyword extraction. The models are uploaded to\nmlflow and can\nbe set in the Dockerfile when building the container:</p>\n<p>example for german model:</p>\n<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"n\">ENV</span> <span class=\"n\">AZURE_RESOURCE_GROUP</span><span class=\"o\">=</span><span class=\"s2\">&quot;mriedl&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_WORKSPACE</span><span class=\"o\">=</span><span class=\"s2\">&quot;azureml-ws&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_MODEL_NAME</span><span class=\"o\">=</span><span class=\"s2\">&quot;keyword-extraction-de&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_MODEL_VERSION</span><span class=\"o\">=</span><span class=\"s2\">&quot;1&quot;</span>\n</pre></div>\n</div>\n<p>and example for english model:</p>\n<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"n\">ENV</span> <span class=\"n\">AZURE_RESOURCE_GROUP</span><span class=\"o\">=</span><span class=\"s2\">&quot;mriedl&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_WORKSPACE</span><span class=\"o\">=</span><span class=\"s2\">&quot;azureml-ws&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_MODEL_NAME</span><span class=\"o\">=</span><span class=\"s2\">&quot;keyword-extraction-de&quot;</span>\n<span class=\"n\">ENV</span> <span class=\"n\">AZURE_AML_MODEL_VERSION</span><span class=\"o\">=</span><span class=\"s2\">&quot;1&quot;</span>\n</pre></div>\n</div>\n</section>\n<section id=\"usage\">\n<h2>Usage<a class=\"headerlink\" href=\"#usage\" title=\"Link to this heading\">#</a></h2>\n<p><strong>Two Options:</strong></p>\n<ol class=\"simple\">\n<li><p>REST: Send text per request to endpoint, endpoint returns keywords</p></li>\n<li><p>Queue: Service gets text from queue, model calculates keywords, save keywords in queue</p></li>\n</ol>\n<p>To test the REST endpoint you have to set up an environment and do poetry install (\nsee https://gitlab.knecon.com/knecon/research/template-python-project for details for setting up poetry)\nThen run</p>\n<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"n\">python</span> <span class=\"o\">./</span><span class=\"n\">src</span><span class=\"o\">/</span><span class=\"n\">serve</span><span class=\"o\">.</span><span class=\"n\">py</span> \n</pre></div>\n</div>\n<p>You don\u2019t need to start a queue for that, just ignore the AMQP Error.\nPort and host are set in settings.toml .\nYou can use the FastAPI under 127.0.0.1:8001/docs to send request to endpoint.</p>\n<p>You can also test the service with docker:</p>\n<section id=\"run-docker-commands\">\n<h3>Run Docker Commands<a class=\"headerlink\" href=\"#run-docker-commands\" title=\"Link to this heading\">#</a></h3>\n<div class=\"highlight-bash notranslate\"><div class=\"highlight\"><pre><span></span>docker<span class=\"w\"> </span>build<span class=\"w\"> </span>-t<span class=\"w\"> </span><span class=\"si\">${</span><span class=\"nv\">IMAGE_NAME</span><span class=\"si\">}</span><span class=\"w\"> </span>-f<span class=\"w\"> </span>Dockerfile<span class=\"w\"> </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">GITLAB_USER</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">GITLAB_USER</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">GITLAB_ACCESS_TOKEN</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">GITLAB_ACCESS_TOKEN</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_TENANT_ID</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_TENANT_ID</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_SUBSCRIPTION_ID</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_SUBSCRIPTION_ID</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_CLIENT_ID</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_CLIENT_ID</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_CLIENT_SECRET</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_CLIENT_SECRET</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_AML_MODEL_VERSION</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_AML_MODEL_VERSION</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_AML_MODEL_NAME</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_AML_MODEL_NAME</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_RESOURCE_GROUP</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_RESOURCE_GROUP</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"se\">\\</span>\n<span class=\"w\">                                            </span>--build-arg<span class=\"w\"> </span><span class=\"nv\">AZURE_AML_WORKSPACE</span><span class=\"o\">=</span><span class=\"si\">${</span><span class=\"nv\">AZURE_AML_WORKSPACE</span><span class=\"si\">}</span>\n</pre></div>\n</div>\n<div class=\"highlight-bash notranslate\"><div class=\"highlight\"><pre><span></span>docker<span class=\"w\"> </span>run<span class=\"w\"> </span>--net<span class=\"o\">=</span>host<span class=\"w\"> </span>-it<span class=\"w\"> </span>--rm<span class=\"w\"> </span>--name<span class=\"w\"> </span><span class=\"si\">${</span><span class=\"nv\">CONTAINER_NAME</span><span class=\"si\">}</span><span class=\"w\"> </span><span class=\"si\">${</span><span class=\"nv\">IMAGE_NAME</span><span class=\"si\">}</span>\n</pre></div>\n</div>\n</section>\n<section id=\"run-locally\">\n<h3>Run locally<a class=\"headerlink\" href=\"#run-locally\" title=\"Link to this heading\">#</a></h3>\n<p>First you need to download the model from mlflow. This can be done with the <em>\u201csrc/ml_flow/download_model.py\u201d</em> script.\nThis scripts downloads a model and copies config and model data to the specific locations, such that the model can\nbe loaded.</p>\n<p>For running/testing the keyword extraction locally you can use the <em>src/tests/test_process.py</em> script.</p>\n<p>Model ist stored and loaded via DVC, you need the connection string under\nhttps://portal.azure.com/#&#64;knecon.com/resource/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourceGroups/taas-rg/providers/Microsoft.Storage/storageAccounts/taassaracer/keys</p>\n</section>\n</section>\n</section>\n<section id=\"upload-models-to-ml-flow\">\n<h1>Upload models to ML Flow<a class=\"headerlink\" href=\"#upload-models-to-ml-flow\" title=\"Link to this heading\">#</a></h1>\n<p>To upload the models to mlflow, you can use following script: src/mlflow/upload_model.py\nFor authentication following environment variables need to be set:</p>\n<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"c1\">#AZURE_TENANT_ID=&quot;&quot;</span>\n<span class=\"c1\">#AZURE_SUBSCRIPTION_ID=&quot;&quot;</span>\n<span class=\"c1\">#AZURE_CLIENT_ID=&quot;&quot;</span>\n<span class=\"c1\">#AZURE_CLIENT_SECRET=&quot;&quot;</span>\n</pre></div>\n</div>\n<p>Additional settings (resource group, experiment name, etc.) can be specified in the config (\n<em>./src/mlflow/config/azure_config.toml</em>).\nThe <em>upload_model.py</em> has the following parameters:</p>\n<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"n\">options</span><span class=\"p\">:</span>\n  <span class=\"o\">-</span><span class=\"n\">h</span><span class=\"p\">,</span> <span class=\"o\">--</span><span class=\"n\">help</span>            <span class=\"n\">show</span> <span class=\"n\">this</span> <span class=\"n\">help</span> <span class=\"n\">message</span> <span class=\"ow\">and</span> <span class=\"n\">exit</span>\n  <span class=\"o\">-</span><span class=\"n\">a</span> <span class=\"n\">AZURE_CONFIG</span><span class=\"p\">,</span> <span class=\"o\">--</span><span class=\"n\">azure_config</span> <span class=\"n\">AZURE_CONFIG</span>\n                        <span class=\"n\">Location</span> <span class=\"n\">of</span> <span class=\"n\">the</span> <span class=\"n\">configuration</span> <span class=\"n\">file</span> <span class=\"k\">for</span> <span class=\"n\">Azure</span> <span class=\"p\">(</span><span class=\"n\">default</span><span class=\"p\">:</span> <span class=\"n\">src</span><span class=\"o\">/</span><span class=\"n\">mlflow</span><span class=\"o\">/</span><span class=\"n\">config</span><span class=\"o\">/</span><span class=\"n\">azure_config</span><span class=\"o\">.</span><span class=\"n\">toml</span><span class=\"p\">)</span>\n  <span class=\"o\">-</span><span class=\"n\">b</span> <span class=\"n\">BASE_CONFIG</span><span class=\"p\">,</span> <span class=\"o\">--</span><span class=\"n\">base_config</span> <span class=\"n\">BASE_CONFIG</span>\n                        <span class=\"n\">Location</span> <span class=\"n\">of</span> <span class=\"n\">the</span> <span class=\"n\">basic</span> <span class=\"n\">training</span> <span class=\"n\">configuration</span> <span class=\"p\">(</span><span class=\"n\">default</span><span class=\"p\">:</span> <span class=\"n\">src</span><span class=\"o\">/</span><span class=\"n\">mlflow</span><span class=\"o\">/</span><span class=\"n\">config</span><span class=\"o\">/</span><span class=\"n\">settings_de</span><span class=\"o\">.</span><span class=\"n\">toml</span><span class=\"p\">)</span>\n  \n</pre></div>\n</div>\n<p>the base config contains all information for the models used. Examples for German and\nEnglish are placed in <em>/src/mlflow/config/</em></p>\n<p>Note: Multi-language model tracking does not work for now. After the upload script reports an error, you have to\nmanually track the\nmodel <a class=\"reference external\" href=\"https://ml.azure.com/experiments?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&amp;tid=b44be368-e4f2-4ade-a089-cd2825458048\">here</a>\nwhere you can find the run. Adhere to the naming conventions for the model name and versions,\nsee <a class=\"reference external\" href=\"https://ml.azure.com/model/list?wsid=/subscriptions/4b9531fc-c5e4-4b11-8492-0cc173c1f97d/resourcegroups/fforesight-rg/providers/Microsoft.MachineLearningServices/workspaces/ff-aml-main&amp;tid=b44be368-e4f2-4ade-a089-cd2825458048\">here</a></p>\n</section>\n",
    "title": "Keyword-Service",
    "sourcename": "README.md.txt",
    "current_page_name": "README",
    "toc": "<ul>\n<li><a class=\"reference internal\" href=\"#\">Keyword-Service</a><ul>\n<li><a class=\"reference internal\" href=\"#api\">API</a><ul>\n<li><a class=\"reference internal\" href=\"#rest\">REST</a></li>\n<li><a class=\"reference internal\" href=\"#rabbitmq\">RabbitMQ</a></li>\n</ul>\n</li>\n<li><a class=\"reference internal\" href=\"#service-configuration\">Service Configuration</a></li>\n<li><a class=\"reference internal\" href=\"#language\">Language</a></li>\n<li><a class=\"reference internal\" href=\"#usage\">Usage</a><ul>\n<li><a class=\"reference internal\" href=\"#run-docker-commands\">Run Docker Commands</a></li>\n<li><a class=\"reference internal\" href=\"#run-locally\">Run locally</a></li>\n</ul>\n</li>\n</ul>\n</li>\n<li><a class=\"reference internal\" href=\"#upload-models-to-ml-flow\">Upload models to ML Flow</a></li>\n</ul>\n",
    "page_source_suffix": ".md"
}