Single node on-prem deployment on Gaudi AI Accelerator

This section covers single-node on-prem deployment of the CodeGen example. It will show how to deploy an end-to-end CodeGen solution with the Qwen2.5-Coder-32B-Instruct model running on Intel® Gaudi® AI Accelerators. To quickly learn about OPEA and set up the required hardware and software, follow the instructions in the Getting Started section.

Overview

The CodeGen use case uses a single microservice called LLM with model serving done with vLLM or TGI.

This solution is designed to demonstrate the use of the Qwen2.5-Coder-32B-Instruct model for code generation on Intel® Gaudi® AI Accelerators. The steps will involve setting up Docker containers, taking text input as the prompt, and generating code. Although multiple versions of the UI can be deployed, this tutorial will focus solely on the default version.

Prerequisites

To run the UI on a web browser external to the host machine such as a laptop, the following port(s) need to be forwarded when using SSH to log in to the host machine:

  • 7778: CodeGen megaservice port

This port is used for BACKEND_SERVICE_ENDPOINT defined in the set_env.sh for this example inside the docker compose folder. Specifically, for CodeGen, append the following to the ssh command:

-L 7778:localhost:7778

Set up a workspace and clone the GenAIExamples GitHub repo.

export WORKSPACE=<Path>
cd $WORKSPACE
git clone https://github.com/opea-project/GenAIExamples.git

Optional It is recommended to use a stable release version by setting RELEASE_VERSION to a number only (i.e. 1.0, 1.1, etc) and checkout that version using the tag. Otherwise, by default, the main branch with the latest updates will be used.

export RELEASE_VERSION=<Release_Version> #  Set desired release version - number only
cd GenAIExamples
git checkout tags/v${RELEASE_VERSION}
cd ..

Set up a HuggingFace account and generate a user access token. The Qwen2.5-Coder-32B-Instruct model does not need special access, but the token can be used with other models requiring access.

Set the HUGGINGFACEHUB_API_TOKEN environment variable to the value of the Hugging Face token by executing the following command:

export HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token"

host_ip is not required to be set manually. It will be set in the set_env.sh script later.

For machines behind a firewall, set up the proxy environment variables:

export no_proxy=${your_no_proxy},$host_ip
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}

Use Case Setup

CodeGen will utilize the following GenAIComps services and associated tools. The tools and models listed in the table can be configured via environment variables in either the set_env.sh script or the compose.yaml file.

Use Case Components

Tools

Model

Service Type

LLM

vLLM, TGI

Qwen/Qwen2.5-Coder-32B-Instruct

OPEA Microservice

UI

NA

Gateway Service

Set the necessary environment variables to set up the use case. To swap out models, modify set_env.sh before running it. For example, the environment variable LLM_MODEL_ID can be changed to another model by specifying the HuggingFace model card ID.

To run the UI on a web browser on a laptop, modify BACKEND_SERVICE_ENDPOINT to use localhost or 127.0.0.1 instead of host_ip inside set_env.sh for the backend to properly receive data from the UI.

Run the set_env.sh script.

cd $WORKSPACE/GenAIExamples/CodeGen/docker_compose
source ./set_env.sh

Deploy the Use Case

Navigate to the docker compose directory for this hardware platform.

cd $WORKSPACE/GenAIExamples/CodeGen/docker_compose/intel/hpu/gaudi

Run docker compose with the provided YAML file to start all the services mentioned above as containers. The vLLM or TGI service can be used for CodeGen.

docker compose --profile codegen-gaudi-vllm up -d
docker compose --profile codegen-gaudi-tgi up -d

Check Env Variables

After running docker compose, check for warning messages for environment variables that are NOT set. Address them if needed.

WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.

Check if all the containers launched via docker compose are running i.e. each container’s STATUS is Up and in some cases Healthy.

Run this command to see this info:

docker ps -a

Sample output:

CONTAINER ID   IMAGE                                                   COMMAND                  CREATED         STATUS                   PORTS                                                                                      NAMES
0040b340a392   opea/codegen-gradio-ui:latest                           "python codegen_ui_g…"   4 minutes ago   Up 3 minutes             0.0.0.0:5173->5173/tcp, [::]:5173->5173/tcp                                                codegen-gaudi-ui-server
3d2c7deacf5b   opea/codegen:latest                                     "python codegen.py"      4 minutes ago   Up 3 minutes             0.0.0.0:7778->7778/tcp, [::]:7778->7778/tcp                                                codegen-gaudi-backend-server
ad59907292fe   opea/dataprep:latest                                    "sh -c 'python $( [ "   4 minutes ago   Up 4 minutes (healthy)   0.0.0.0:6007->5000/tcp, [::]:6007->5000/tcp                                                dataprep-redis-server
2cb4e0a6562e   opea/retriever:latest                                   "python opea_retriev…"   4 minutes ago   Up 4 minutes             0.0.0.0:7000->7000/tcp, [::]:7000->7000/tcp                                                retriever-redis
f787f774890b   opea/llm-textgen:latest                                 "bash entrypoint.sh"     4 minutes ago   Up About a minute        0.0.0.0:9000->9000/tcp, [::]:9000->9000/tcp                                                llm-codegen-vllm-server
5880b86091a5   opea/embedding:latest                                   "sh -c 'python $( [ …"   4 minutes ago   Up 4 minutes             0.0.0.0:6000->6000/tcp, [::]:6000->6000/tcp                                                tei-embedding-server
cd16e3c72f17   opea/llm-textgen:latest                                 "bash entrypoint.sh"     4 minutes ago   Up 4 minutes                                                                                                        llm-textgen-server
cd412bca7245   redis/redis-stack:7.2.0-v9                              "/entrypoint.sh"         4 minutes ago   Up 4 minutes             0.0.0.0:6379->6379/tcp, [::]:6379->6379/tcp, 0.0.0.0:8001->8001/tcp, [::]:8001->8001/tcp   redis-vector-db
8d4e77afc067   opea/vllm:latest                                        "python3 -m vllm.ent…"   4 minutes ago   Up 4 minutes (healthy)   0.0.0.0:8028->80/tcp, [::]:8028->80/tcp                                                    vllm-server
f7c1cb49b96b   ghcr.io/huggingface/text-embeddings-inference:cpu-1.5   "/bin/sh -c 'apt-get…"   4 minutes ago   Up 4 minutes (healthy)   0.0.0.0:8090->80/tcp, [::]:8090->80/tcp                                                    tei-embedding-serving

Each docker container’s log can also be checked using:

docker logs <CONTAINER_ID OR CONTAINER_NAME>

Validate Microservices

This section will guide through the various methods for interacting with the deployed microservices.

vLLM or TGI Service

curl http://${host_ip}:8028/v1/chat/completions \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{"model": "Qwen/Qwen2.5-Coder-32B-Instruct", "messages": [{"role": "user", "content": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}], "max_tokens":32}'

Here is sample output:

{"generated_text":"\n\nIO iflow diagram:\n\n!\[IO flow diagram(s)\]\(TodoList.iflow.svg\)\n\n### TDD Kata walkthrough\n\n1. Start with a user story. We will add story tests later. In this case, we'll choose a story about adding a TODO:\n    ```ruby\n    as a user,\n    i want to add a todo,\n    so that i can get a todo list.\n\n    conformance:\n    - a new todo is added to the list\n    - if the todo text is empty, raise an exception\n    ```\n\n1. Write the first test:\n    ```ruby\n    feature Testing the addition of a todo to the list\n\n    given a todo list empty list\n    when a user adds a todo\n    the todo should be added to the list\n\n    inputs:\n    when_values: [[\"A\"]]\n\n    output validations:\n    - todo_list contains { text:\"A\" }\n    ```\n\n1. Write the first step implementation in any programming language you like. In this case, we will choose Ruby:\n    ```ruby\n    def add_"}

LLM Microservice

curl http://${host_ip}:9000/v1/chat/completions\
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"query":"Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception.","max_tokens":256,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"stream":true}'

The output code is printed one character at a time. It is too long to show here but the last item will be

data: [DONE]

Dataprep Microservice

The following is a template only. Replace the filename placeholders with desired files.

curl http://${host_ip}:6007/v1/dataprep/ingest \
-X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.pdf" \
-F "files=@./file2.txt" \
-F "index_name=my_API_document"

CodeGen Megaservice

Default:

curl http://${host_ip}:7778/v1/codegen -H "Content-Type: application/json" -d '{
     "messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."
     }'

The output code is printed one character at a time. It is too long to show here but the last item will be

data: [DONE]

The CodeGen Megaservice can also be utilized with RAG and Agents activated:

curl http://${host_ip}:7778/v1/codegen \
  -H "Content-Type: application/json" \
  -d '{"agents_flag": "True", "index_name": "my_API_document", "messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'

Launch UI

Gradio UI

To access the frontend, open the following URL in a web browser: http://${host_ip}:5173. By default, the UI runs on port 5173 internally. A different host port can be used to access the frontend by modifying the port mapping in the compose.yaml file as shown below:

  codegen-gaudi-ui-server:
    image: ${REGISTRY:-opea}/codegen-gradio-ui:${TAG:-latest}
    ...
    ports:
      - "YOUR_HOST_PORT:5173" # Change YOUR_HOST_PORT to the desired port

After making this change, restart the containers for the change to take effect.

Stop the Services

Navigate to the docker compose directory for this hardware platform.

cd $WORKSPACE/GenAIExamples/CodeGen/docker_compose/intel/hpu/gaudi

To stop and remove all the containers, use the commands below:

docker compose --profile codegen-gaudi-vllm down
docker compose --profile codegen-gaudi-tgi down