# LLM OVMS Microservice LLM OVMS microservice uses [OpenVINO Model Server](https://github.com/openvinotoolkit/model_server). It can efficient generate text on Intel CPU using a set of optimizations techniques list continuous batching, paged attention, prefix caching, speculative decoding and many other. --- ## Table of Contents 1. [Start OVMS Microservice](#start-ovms-microservice) 2. [Start OPEA LLM Microservice](#start-opea-llm-microservice) 3. [Consume Microservice](#consume-microservice) 4. [Tips](#tips) --- ## Start OVMS Microservice ### Prepare Models To start the OVMS service, you need to export models from Hugging Face Hub to the IR format. This step optionally includes quantization, which speeds up service startup and prevents repeated downloads. ```bash pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/requirements.txt curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/0/demos/common/export_models/export_model.py -o export_model.py mkdir models python export_model.py text-generation --source_model Qwen/Qwen2-7B-Instruct --weight-format int8 --config_file_path models/config_llm.json --model_repository_path models --target_device CPU ``` Change the `source_model` as needed. ### Start the OVMS container: Replace `your_port` with desired values to start the service. ```bash your_port=8090 docker run -p $your_port:8000 -v ./models:/models --name ovms-llm-serving \ openvino/model_server:2025.0 --port 8000 --config_path /models/config_llm.json ``` ### Test the OVMS container: OVMS exposes REST API compatible with OpenAI API. Both `completions` and `chat/completions` are supported. Run the following command to check if the service is up and running. ```bash curl -s http://localhost:8090/v3/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2-7B-Instruct", "max_tokens":30, "temperature":0, "stream":false, "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What are the 3 main tourist attractions in Paris?" } ] }' ``` --- ## Start OPEA LLM Microservice ### Building the image ```bash cd ../../../../../ docker build -t opea/llm-textgen-ovms:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/src/text-generation/Dockerfile . ``` To start a docker container, you have two options: - A. Run Docker with CLI - B. Run Docker with Docker Compose You can choose one as needed. ### Option A: Run Docker with CLI ```bash export LLM_ENDPOINT=http://localhost:8090 export MODEL_ID=Qwen/Qwen2-7B-Instruct docker run -d --name="llm-ovms-server" -p 9000:9000 \ -e MODEL_ID=${MODEL_ID} \ -e LLM_COMPONENT_NAME=OpeaTextGenOVMS \ -e OVMS_LLM_ENDPOINT=${LLM_ENDPOINT} \ opea/llm-textgen-ovms:latest ``` ### Option B: Run Docker with Docker Compose ```bash export service_name="textgen-ovms" cd comps/llms/deployment/docker_compose docker compose -f compose_text-generation.yaml up ${service_name} -d ``` --- ## Consume Microservice ### Check Service Status ```bash curl http://localhost:9000/v1/health_check\ -X GET \ -H 'Content-Type: application/json' ``` ### Consume LLM Service ```bash curl http://localhost:9000/v1/chat/completions\ -X POST \ -d '{"messages":"What is Deep Learning?"}' \ -H 'Content-Type: application/json' ``` ```bash curl http://localhost:9000/v1/chat/completions\ -X POST \ -d '{"messages":"What is Deep Learning?", "stream": true}' \ -H 'Content-Type: application/json' ``` --- ## Tips 1. Port Mapping: Ensure the ports are correctly mapped to avoid conflicts with other services. 2. Model Selection: Choose a model appropriate for your use case, like "Qwen/Qwen2-7B-Instruct". It should be exported to the models repository and set in 'MODEL_ID' env in the deployment of the OPEA API wrapper. 3. Models repository Volume: The `-v ./models:/models` flag ensures the models directory is correctly mounted. 4. Select correct configuration JSON file Models repository can host multiple models. Choose the models to be served by selecting the right configuration file. In the example above `config_llms.json` 5. Upload the models to persistent volume claim in Kubernetes Models repository with configuration JSON file will be mounted in the OVMS containers when deployed via [helm chart](../../../third_parties/ovms/deployment/kubernetes/README.md). 6. Learn more about [OVMS chat/completions API](https://docs.openvino.ai/2025/model-server/ovms_docs_rest_api_chat.html)