LVM Microservice with vLLM on Gaudi

This service provides high-throughput, low-latency LVM serving accelerated by vLLM, optimized for Intel Gaudi HPUs.


Table of Contents

  1. Prerequisites

  2. Start Microservice

  3. Consume LVM Service


Prerequisites

Build vLLM Gaudi Docker Image

You must build the custom vllm-gaudi Docker image locally first.

git clone https://github.com/HabanaAI/vllm-fork.git
cd ./vllm-fork/
# Note: The commit hash is for a specific version. Check for updates.
git checkout f78aeb9da0712561163eddd353e3b6097cd69bac
docker build -f Dockerfile.hpu -t opea/vllm-gaudi:latest --shm-size=128g . --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy
cd ..
rm -rf vllm-fork

Start Microservice

Build LVM Docker Image

Build the generic LVM microservice Docker image:

cd ../../../
docker build -t opea/lvm:latest \
  --build-arg https_proxy=$https_proxy \
  --build-arg http_proxy=$http_proxy \
  -f comps/lvms/src/Dockerfile .

Run with Docker Compose

Deploy the vLLM service and the LVM microservice using Docker Compose.

  1. Export the required environment variables:

    export ip_address=$(hostname -I | awk '{print $1}')
    export LVM_PORT=9399
    export VLLM_PORT=11507
    export LVM_ENDPOINT="http://$ip_address:$VLLM_PORT"
    
    # Option 1: for LLaVA model
    export LLM_MODEL_ID=llava-hf/llava-1.5-7b-hf
    export CHAT_TEMPLATE=examples/template_llava.jinja
    
    # Option 2: for UI-TARS model
    # export LLM_MODEL_ID=bytedance-research/UI-TARS-7B-DPO
    # export TP_SIZE=1    # change to 4 or 8 if using UI-TARS-72B-DPO
    # export CHAT_TEMPLATE=None
    
    # Skip warmup for faster server start on Gaudi (may increase initial inference time)
    export VLLM_SKIP_WARMUP=true
    
  2. Navigate to the Docker Compose directory and start the services:

    cd comps/lvms/deployment/docker_compose/
    docker compose up vllm-gaudi-service lvm-vllm-gaudi -d
    

Consume LVM Service

Once the service is running, you can send requests to the API.

Use the LVM Service API

Send a POST request with an image (base64 encoded) and a prompt.

curl http://localhost:9399/v1/lvm \
  -X POST \
  -d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}' \
  -H 'Content-Type: application/json'