LLM Native Microservice¶
LLM Native microservice uses optimum-habana for model initialization and warm-up, focusing solely on large language models (LLMs). It operates without frameworks like TGI/VLLM, using PyTorch directly for inference, and supports only non-stream formats. This streamlined approach optimizes performance on Habana hardware.
🚀1. Start Microservice¶
If you start an LLM microservice with docker, the docker_compose_llm.yaml
file will automatically start a Native LLM service with docker.
1.1 Setup Environment Variables¶
In order to start Native LLM service, you need to setup the following environment variables first.
For LLM model, both Qwen
and Falcon3
models are supported. Users can set different models by changing the LLM_NATIVE_MODEL
below.
export LLM_NATIVE_MODEL="Qwen/Qwen2-7B-Instruct"
export HUGGINGFACEHUB_API_TOKEN="your_huggingface_token"
1.2 Build Docker Image¶
cd ../../../../../
docker build -t opea/llm-native:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/src/text-generation/Dockerfile .
To start a docker container, you have two options:
A. Run Docker with CLI
B. Run Docker with Docker Compose
You can choose one as needed.
1.3 Run Docker with CLI (Option A)¶
docker run -d --runtime=habana --name="llm-native-server" -p 9000:9000 -e https_proxy=$https_proxy -e http_proxy=$http_proxy -e TOKENIZERS_PARALLELISM=false -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e LLM_NATIVE_MODEL=${LLM_NATIVE_MODEL} opea/llm-native:latest
1.4 Run Docker with Docker Compose (Option B)¶
docker compose -f docker_compose_llm.yaml up -d
🚀2. Consume LLM Service¶
2.1 Check Service Status¶
curl http://${your_ip}:9000/v1/health_check\
-X GET \
-H 'Content-Type: application/json'
2.2 Consume LLM Service¶
curl http://${your_ip}:9000/v1/chat/completions\
-X POST \
-d '{"messages":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'