Single node on-prem deployment with TGI on Gaudi AI Accelerator

This deployment section covers single-node on-prem deployment of the AudioQnA example with OPEA comps to deploy using TGI service. The solution demonstrates building an voice chat service using the TGI deployed on Intel® Gaudi® AI Accelerator. To quickly learn about OPEA in just 5 minutes and set up the required hardware and software, please follow the instructions in the Getting Started section.

Overview

There are several ways to setup a AudioQnA use case. Here in this tutorial, we will walk through how to enable the below list of microservices from OPEA GenAIComps to deploy a single node TGI megaservice solution.

  1. Automatic Speech Recognition (ASR) Service

  2. Large Language Models (LLM) Service

  3. Text-to-Speech (TTS) Service

The solution is aimed to show how to use ASR, TGI and TTS on Gaudi AI Accelerator. We will go through how to setup docker container to start a microservices and megaservice . The solution will then utilize a sample audio file which is in waw format.

Prerequisites

The first step is to clone the GenAIExamples and GenAIComps projects. GenAIComps are fundamental necessary components used to build the examples you find in GenAIExamples and deploy them as microservices. Set an environment variable for the desired release version with the number only (i.e. 1.0, 1.1, etc) and checkout using the tag with that version.

# Set workspace
export WORKSPACE=<path>
cd $WORKSPACE

# Set desired release version - number only
export RELEASE_VERSION=<insert-release-version>

# GenAIComps
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps
git checkout tags/v${RELEASE_VERSION}
cd ..

# GenAIExamples
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples
git checkout tags/v${RELEASE_VERSION}
cd ..

The example requires you to set the the following variables to deploy the microservices on endpoint enabled with ports.

export host_ip=$(hostname -I | awk '{print $1}')
export HUGGINGFACEHUB_API_TOKEN=<your HF token>

export LLM_MODEL_ID=Intel/neural-chat-7b-v3-3

export MEGA_SERVICE_HOST_IP=${host_ip}
export WHISPER_SERVER_HOST_IP=${host_ip}
export SPEECHT5_SERVER_HOST_IP=${host_ip}
export LLM_SERVER_HOST_IP=${host_ip}
export GPT_SOVITS_SERVER_HOST_IP=${host_ip}

export WHISPER_SERVER_PORT=7066
export SPEECHT5_SERVER_PORT=7055
export GPT_SOVITS_SERVER_PORT=9880
export LLM_SERVER_PORT=3006

export BACKEND_SERVICE_ENDPOINT=http://${host_ip}:3008/v1/audioqna

Make sure to setup Proxies if you are behind a firewall

export no_proxy=${your_no_proxy},$host_ip
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}

Prepare (Building / Pulling) Docker images

This step involves either building or pulling three required Docker images. Each image serves a specific purpose in the AudioQnA architecture.

If you decide to pull the docker containers and not build them locally, you can proceed to the next step where all the necessary containers will be pulled in from Docker Hub.

Follow the steps below to build the docker images from within the GenAIComps folder. Note: For RELEASE_VERSIONS older than 1.0, you will need to add a ‘v’ in front of ${RELEASE_VERSION} to reference the correct image on Docker Hub.

cd $WORKSPACE/GenAIComps

Build ASR Image

First, build the Automatic Speech Recognition service image:

docker build -t opea/whisper-gaudi:${RELEASE_VERSION}--build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/src/integrations/dependency/whisper/Dockerfile.intel_hpu .

Build LLM Image

Gaudi AI Accelerator optimized image hosted in huggingface repo will be used for TGI service: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-gaudi

Build TTS Image

Build the Text-to-Speech service image:

docker build -t opea/speecht5-gaudi:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/tts/src/integrations/dependency/speecht5/Dockerfile.intel_hpu .

Build MegaService Image

The Megaservice is a pipeline that channels data through different microservices, each performing varied tasks. We define the different microservices and the flow of data between them in the audioqna.py file.

Build the megaservice image for this use case.

cd $WORKSPACE/GenAIExamples/AudioQnA/
docker build --no-cache -t opea/audioqna:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .

Sanity Check

Before proceeding, verify that you have all required Docker images by running docker images. You should see the following images:

  • opea/whisper:${RELEASE_VERSION}

  • opea/speecht5:${RELEASE_VERSION}

  • opea/audioqna:${RELEASE_VERSION}

Use Case Setup

The use case will use the following combination of the GenAIComps with the tools.

Use Case Components

Tools

Model

Service Type

LLM

TGI

Intel/neural-chat-7b-v3-3

OPEA Microservice

ASR

NA

OPEA Microservice

TTS

NA

OPEA Microservice

Tools and models mentioned in the table are configurable either through the environment variable or compose.yaml

Set the necessary environment variables to setup the use case by running the set_env.sh script. Here is where the environment variable LLM_MODEL_ID is set, and you can change it to another model by specifying the HuggingFace model card ID.

Run the set_env.sh script.

cd $WORKSPACE/GenAIExamples/AudioQnA/docker_compose
source ./set_env.sh

Deploy the use case

In this tutorial, we will be deploying via docker compose with the provided YAML file. The docker compose instructions should start all the above-mentioned services as containers.

export MODEL_CACHE=./data

cd $WORKSPACE/GenAIExamples/AudioQnA/docker_compose/intel/hpu/gaudi/
docker compose up -d

# multilang tts (optional)
docker compose -f compose_multilang.yaml up -d

Note: add the following environment variables in compose yaml if meet issues for downloading models:

HF_ENDPOINT: https://hf-mirror.com
HF_HUB_ENABLE_HF_TRANSFER: false

Validate microservice

Check Env Variables

Check the startup log by docker compose -f ./compose.yaml logs. The warning messages print out the variables if they are NOT set.

GenAIExamples/AudioQnA/docker_compose/intel/hpu/gaudi/$ sudo -E docker compose -f ./compose.yaml logs
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string. 
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string. 
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string. 
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string. 
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.

Check the container status

Check if all the containers launched via docker compose have started. For example, the AudioQnA example starts 5 docker containers (services), check these docker containers are all running, i.e., all the containers STATUS are Up.

To do a quick sanity check, try docker ps -a to see if all the containers are running.

| CONTAINER ID | IMAGE                                                             | COMMAND                   | CREATED         | STATUS                            | PORTS                                      | NAMES                           |
|--------------|-------------------------------------------------------------------|---------------------------|----------------|------------------------------------|---------------------------------------------|---------------------------------|
| 83c27b0968f3 | opea/audioqna-ui:${RELEASE_VERSION}                               | `"docker-entrypoint.s…"`  | 37 minutes ago | Up 37 minutes                      | 0.0.0.0:5173->5173/tcp, [::]:5173->5173/tcp | audioqna-gaudi-ui-server        |
| 0bb32d7b586f | opea/audioqna:${RELEASE_VERSION}                                  | `"python audioqna.py"`    | 37 minutes ago | Up 37 minutes                      | 0.0.0.0:3008->8888/tcp, [::]:3008->8888/tcp | audioqna-gaudi-backend-server   |
| 5eab9c5d7199 | ghcr.io/huggingface/tgi-gaudi                                     | `"text-generation-lau…"`  | 37 minutes ago | Up 37 minutes (unhealthy)          | 0.0.0.0:3006->80/tcp, [::]:3006->80/tcp     | tgi-service                     |
| bbde822725d4 | opea/whisper-gaudi:${RELEASE_VERSION}                             | `"python whisper_serv…"`  | 37 minutes ago | Up 37 minutes                      | 0.0.0.0:7066->7066/tcp, [::]:7066->7066/tcp | whisper-service                 |
| 1290ccd09182 | opea/speecht5-gaudi:${RELEASE_VERSION}                            | `"python speecht5_ser…"`  | 37 minutes ago | Up 37 minutes                      | 0.0.0.0:7055->7055/tcp, [::]:7055->7055/tcp | speecht5-service                |

Interacting with AudioQnA deployment

In this section, you will walk through the different ways to interact with the deployed microservices.

Whisper Service

# whisper service
wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav
curl http://${host_ip}:7066/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F file="@./sample.wav" \
  -F model="openai/whisper-small"

Whisper service generates text for the input audio file. Here is the expected result from Whisper:

{"text":"who is pat gelsinger"}

TGI Service

# tgi service
curl http://${host_ip}:3006/generate \
  -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
  -H 'Content-Type: application/json'

TGI service handles the core language model operations. Here is the expected result from TGI:

{"generated_text":"\n\nDeep learning is a subset of machine learning and broadly defined as techniques to"}

Speecht5 Service

# speecht5 service
curl http://${host_ip}:7055/v1/audio/speech -XPOST -d '{"input": "Who are you?"}' -H 'Content-Type: application/json' --output speech.mp3

Speecht5 service generates an audio file from the given sentense. The expected outputs is an audio file that says “Who are you?”.

MegaService

The AudioQnA megaservice orchestrates the entire conversation process. Test it with a empty audio:

# if you are using speecht5 as the tts service, voice can be "default" or "male"
# if you are using gpt-sovits for the tts service, you can set the reference audio following https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/integrations/dependency/gpt-sovits/README.md
curl http://${host_ip}:3008/v1/audioqna \
  -X POST \
  -d '{"audio": "UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA", "max_tokens":64, "voice":"default"}' \
  -H 'Content-Type: application/json' | sed 's/^"//;s/"$//' | base64 -d > output.wav

The expected output is a meaningful audio file.

Check the docker container logs

Following is an example of debugging using Docker logs:

Check the log of the container using:

docker logs <CONTAINER ID> -t

View the docker input parameters in $WORKSPACE/GenAIExamples/AudioQnA/docker_compose/intel/hpu/gaudi/compose.yaml

Stop the services

Once you are done with the entire pipeline and wish to stop and remove all the containers, use the command below:

docker compose -f compose.yaml down