Example SearchQnA deployments on AMD GPU (ROCm)

This document outlines the deployment process for a SearchQnA application utilizing the GenAIComps microservice pipeline on AMD GPU (ROCm).

This example includes the following sections:

SearchQnA Quick Start Deployment

This section describes how to quickly deploy and test the SearchQnA service manually on AMD GPU (ROCm). The basic steps are:

  1. Access the Code

  2. Generate a HuggingFace Access Token

  3. Configure the Deployment Environment

  4. Deploy the Services Using Docker Compose

  5. Check the Deployment Status

  6. Test the Pipeline

  7. Cleanup the Deployment

Access the Code

Clone the GenAIExample repository and access the SearchQnA AMD GPU (ROCm) Docker Compose files and supporting scripts:

git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples/SearchQnA/docker_compose/amd/gpu/rocm

Checkout a released version, such as v1.2:

git checkout v1.2

Generate a HuggingFace Access Token

Some HuggingFace resources require an access token. Developers can create one by first signing up on HuggingFace and then generating a user access token.

Configure the Deployment Environment

To set up environment variables for deploying SearchQnA services, source the setup_env.sh script in this directory:

//with TGI:
source ./set_env.sh
//with VLLM:
source ./set_env_vllm.sh

The setup_env.sh script will prompt for required and optional environment variables used to configure the SearchQnA services based on TGI. The setup_env_vllm.sh script will prompt for required and optional environment variables used to configure the SearchQnA services based on VLLM. If a value is not entered, the script will use a default value for the same. It will also generate a .env file defining the desired configuration. Consult the section on SearchQnA Service configuration for information on how service specific configuration parameters affect deployments.

Deploy the Services Using Docker Compose

To deploy the SearchQnA services, execute the docker compose up command with the appropriate arguments. For a default deployment, execute:

//with TGI:
docker compose -f compose.yaml up -d
//with VLLM:
docker compose -f compose_vllm.yaml up -d

Note: developers should build docker image from source when:

  • Developing off the git main branch (as the container’s ports in the repo may be different from the published docker image).

  • Unable to download the docker image.

  • Use a specific version of Docker image.

Please refer to the table below to build different microservices from source:

Microservice

Deployment Guide

Reranking

whisper build guide

vLLM

vLLM build guide

LLM-TextGen

LLM-TextGen build guide

Web-Retriever

Web-Retriever build guide

Embedding

Embedding build guide

MegaService

MegaService build guide

UI

Basic UI build guide

Check the Deployment Status

After running Docker Compose, the list of images can be checked using the following command:

docker ps -a

For the default deployment, the following containers should have started

Test the Pipeline

Once the SearchQnA services are running, test the pipeline using the following command:

    DATA='{"messages": "What is the latest news from the AI world? '\
    'Give me a summary.","stream": "True"}'

    curl http://${host_ip}:3008/v1/searchqna \
    -H "Content-Type: application/json" \
    -d "$DATA"

Note The value of host_ip was set using the set_env.sh script and can be found in the .env file.

Checking the response from the service. The response should be similar to JSON:

data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":",","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" with","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" calls","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" for","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" more","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" regulation","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" and","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":" trans","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":"parency","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":null,"index":0,"logprobs":null,"text":".","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: {"id":"cmpl-f095893d094a4e9989423c2364f00bc1","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"text":"","stop_reason":null}],"created":1742960360,"model":"Intel/neural-chat-7b-v3-3","object":"text_completion","system_fingerprint":null,"usage":null}
data: [DONE]

A response text similar to the one above indicates that the service verification was successful.

Cleanup the Deployment

To stop the containers associated with the deployment, execute the following command:

//with TGI:
docker compose -f compose.yaml down
//with VLLM:
docker compose -f compose_vllm.yaml down

All the SearchQnA containers will be stopped and then removed on completion of the “down” command.

SearchQnA Docker Compose Files

When deploying the SearchQnA pipeline on AMD GPUs (ROCm), different large language model serving frameworks can be selected. The table below outlines the available configurations included in the application.

File

Description

compose.yaml

Default compose file using tgi as serving framework

compose_vllm.yaml

The LLM serving framework is vLLM. All other configurations remain the same as the default

Launch the UI

Access the UI at http://${EXTERNAL_HOST_IP}:${SEARCH_FRONTEND_SERVICE_PORT}. A page should open when navigating to this address. UI start page

The appearance of such a page indicates that the service is operational and responsive, allowing functional UI testing to proceed.

Let’s enter the task for the service in the “Enter prompt here” field. For example, “What is DeepLearning?” and press Enter. After that, a page with the result of the task should open:

UI start page A correct result displayed on the page indicates that the UI service has been successfully verified.