LLM Native Microservice¶
LLM Native microservice uses optimum-habana for model initialization and warm-up, focusing solely on large language models (LLMs). It operates without frameworks like TGI/VLLM, using PyTorch directly for inference, and supports only non-stream formats. This streamlined approach optimizes performance on Habana hardware.
🚀1. Start Microservice¶
1.1 Setup Environment Variables¶
In order to start Native LLM service, you need to setup the following environment variables first.
For LLM model, both Qwen
, Falcon3
and Phi4
models are supported. Users can set different models by changing the LLM_MODEL_ID
below.
export LLM_MODEL_ID="Qwen/Qwen2-7B-Instruct"
export HF_TOKEN="your_huggingface_token"
export TEXTGEN_PORT=10512
export LLM_COMPONENT_NAME="OpeaTextGenNative"
export host_ip=${host_ip}
Note. If you want to run “microsoft/Phi-4-multimodal-instruct”, please download the model weights manually and put at /path/to/Phi-4-multimodal-instruct
locally, then setup following environment.
export LLM_MODEL_ID="/path/to/Phi-4-multimodal-instruct"
export LLM_COMPONENT_NAME="OpeaTextGenNativePhi4Multimodal"
1.2 Build Docker Image¶
## For `Qwen` and `Falcon`
dockerfile_path="comps/llms/src/text-generation/Dockerfile.intel_hpu"
export image_name="opea/llm-textgen-gaudi:latest"
## For `Phi4`
# dockerfile_path="comps/llms/src/text-generation/Dockerfile.intel_hpu_phi4"
# export image_name="opea/llm-textgen-phi4-gaudi:latest"
cd ../../../../../
docker build -t $image_name --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f $dockerfile_path .
To start a docker container, you have two options:
A. Run Docker with CLI
B. Run Docker with Docker Compose
You can choose one as needed.
1.3 Run Docker with CLI (Option A)¶
docker run -d --runtime=habana --name="llm-native-server" -p $TEXTGEN_PORT:9000 -e https_proxy=$https_proxy -e http_proxy=$http_proxy -e TOKENIZERS_PARALLELISM=false -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host -e LLM_MODEL_ID=${LLM_MODEL_ID} -e LLM_COMPONENT_NAME=$LLM_COMPONENT_NAME $image_name
1.4 Run Docker with Docker Compose (Option B)¶
export service_name="textgen-native-gaudi"
# export service_name="textgen-native-phi4-gaudi" # For Phi-4-mini-instruct
# export service_name="textgen-native-phi4-multimodal-gaudi" #Phi-4-multimodal-instruct
cd comps/llms/deployment/docker_compose
docker compose -f compose_text-generation.yaml up ${service_name} -d
🚀2. Consume LLM Service¶
2.1 Check Service Status¶
curl http://${your_ip}:9000/v1/health_check\
-X GET \
-H 'Content-Type: application/json'
2.2 Consume LLM Service¶
curl http://${your_ip}:9000/v1/chat/completions\
-X POST \
-d '{"messages":"What is Deep Learning?", "max_tokens":17}' \
-H 'Content-Type: application/json'
If you run a multimodal model such as Phi-4-multimodal-instruct
, you can try with image or audio input.
#image
curl http://${your_ip}:9000/v1/chat/completions\
-X POST \
-d '{"messages":"What is shown in this image?", "image_path":"/path/to/image", "max_tokens":17}' \
-H 'Content-Type: application/json'
#audio
curl http://${your_ip}:9000/v1/chat/completions\
-X POST \
-d '{"messages":"Based on the attached audio, generate a comprehensive text transcription of the spoken content.", "audio_path":"/path/to/audio", "max_tokens":17}' \
-H 'Content-Type: application/json'