# Single node on-prem deployment with TGI on Gaudi AI Accelerator

This deployment section covers single-node on-prem deployment of the DocSum
example with OPEA comps to deploy using the TGI service. We will be showcasing how
to build an e2e DocSum solution with the Intel/neural-chat-7b-v3-3 model, 
deployed on Intel® Gaudi AI Accelerators. To quickly learn about OPEA in just 5 minutes 
and set up the required hardware and software, please follow the instructions in the 
[Getting Started](../../../getting-started/README.md) section. 

## Overview

The DocSum example uses an LLM and an ASR microservice. In this tutorial, we  
will walk through the steps on how to enable it from OPEA GenAIComps to deploy on 
a single node. 

The solution uses the Intel/neural-chat-7b-v3-3 model on the 
Gaudi AI Accelerator. We will go through how to set up docker containers to start 
the microservice and megaservice. The solution will then take a document(.txt,.doc,.pdf), audio or 
video file as the input and generate a summary. It is deployed with a UI with 3 modes to 
choose from:

1. Gradio-Based UI
2. Svelte-Based UI
3. React-Based UI

Use the Gradio UI if you will be working with multimedia documents, .doc, or .pdf files.
Below is the list of content we will be covering in this tutorial:

1. Prerequisites
2. Prepare (Building / Pulling) Docker images
3. Use case setup
4. Deploy the use case
5. Interacting with DocSum deployment

## Prerequisites

The first step is to clone the GenAIExamples and GenAIComps projects. GenAIComps are 
fundamental necessary components used to build the examples you find in 
GenAIExamples and deploy them as microservices. Set an environment 
variable for the desired release version with the **number only** 
(i.e. 1.0, 1.1, etc) and checkout using the tag with that version. 

```bash
# Set workspace and navigate into it
export WORKSPACE=<path>
cd $WORKSPACE

# Set desired release version - number only
export RELEASE_VERSION=<insert-release-version>

# GenAIComps
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps
git checkout tags/v${RELEASE_VERSION}
cd ..

# GenAIExamples
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples
git checkout tags/v${RELEASE_VERSION}
cd ..
```

The example requires you to set the `host_ip` to deploy the microservices on
the endpoint enabled with ports. Set the host_ip env variable.
```bash
export host_ip=$(hostname -I | awk '{print $1}')
```

Make sure to set Proxies if you are behind a firewall.
```bash
export no_proxy=${your_no_proxy},$host_ip
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
```

## Prepare (Building / Pulling) Docker images

This step will involve building/pulling relevant docker
images with a step-by-step process along with a sanity check at the end. For
DocSum, the following docker images will be needed: llm-docsum and whisper. 
Additionally, you will need to build docker images for the 
DocSum megaservice, while the UI (Svelte/React) is optional. In total,
there are **4 required docker images** and two optional docker images.

### Build/Pull Microservice image

::::::{tab-set}

:::::{tab-item} Pull
:sync: Pull

If you decide to pull the docker containers and not build them locally,
you can proceed to the next step where all the necessary containers will
be pulled in from the Docker hub.

:::::
:::::{tab-item} Build
:sync: Build

Follow the steps below to build the docker images from within the `GenAIComps` folder.
**Note:** For RELEASE_VERSIONS older than 1.0, you will need to add a 'v' in front 
of ${RELEASE_VERSION} to reference the correct image on Docker Hub.

```bash
cd $WORKSPACE/GenAIComps
```

#### Build Whisper Service

```bash
docker build -t opea/whisper:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/src/integrations/dependency/whisper/Dockerfile .
```

### Build Mega Service images

The Megaservice is a pipeline that channels data through different
microservices, each performing varied tasks. The LLM, whisper microservice, and 
flow of data are defined in the `docsum.py` file. You can also add or 
remove microservices and customize the megaservice to suit your needs.

Build the megaservice image for this use case

```bash
cd $WORKSPACE/GenAIExamples/DocSum
```

```bash
docker build -t opea/docsum:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
```

### Build the UI Image

There are 3 UI options. Below are instructions to build each.

*Gradio UI*

```bash
cd $WORKSPACE/GenAIExamples/DocSum/ui
docker build -t opea/docsum-gradio-ui:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f docker/Dockerfile.gradio .
```

*Svelte UI (Optional)*

```bash
cd $WORKSPACE/GenAIExamples/DocSum/ui
docker build -t opea/docsum-ui:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f docker/Dockerfile .
```

*React UI (Optional)* 
If you want a React-based frontend.

```bash
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/docsum"
docker build -t opea/docsum-react-ui:${RELEASE_VERSION} --build-arg BACKEND_SERVICE_ENDPOINT=$BACKEND_SERVICE_ENDPOINT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy  -f ./docker/Dockerfile.react .
```

### Sanity Check
Check if you have the following set of docker images by running the command `docker images` before moving on to the next step. 
The tags are based on what you set the environment variable `RELEASE_VERSION` to. 

* `opea/whisper:${RELEASE_VERSION}`
* `opea/docsum:${RELEASE_VERSION}`
* `opea/docsum-gradio-ui:${RELEASE_VERSION}`
* `opea/docsum-ui:${RELEASE_VERSION}` (optional)
* `opea/docsum-react-ui:${RELEASE_VERSION}` (optional)

:::::
::::::

## Use Case Setup

The use case will use the following combination of GenAIComps and tools.

|Use Case Components  | Tools        | Model                       | Service Type      |
|----------------     |--------------|-----------------------------|------------------ |
|LLM                  |   TGI        | Intel/neural-chat-7b-v3-3   | OPEA Microservice |
|ASR                  |   Whisper    | openai/whisper-small        | OPEA Microservice |
|UI                   |              | NA                          | Gateway Service   |

Tools and models mentioned in the table are configurable either through the
environment variables or `compose.yaml` file.

Set the necessary environment variables to setup the use case by running the `set_env.sh` script.
Here is where the environment variable `LLM_MODEL_ID` is set, and you can change it to another model 
by specifying the HuggingFace model card ID.

**Note:** If you wish to run the UI on a web browser on your laptop, you will need to modify `BACKEND_SERVICE_ENDPOINT` to use `localhost` or `127.0.0.1` instead of `host_ip` inside `set_env.sh` for the backend to properly receive data from the UI. Additionally, you will need to port-forward the port used for `BACKEND_SERVICE_ENDPOINT`. Specifically, for DocSum, append the following to your ssh command: 

```bash
-L 8888:localhost:8888
```

Run the `set_env.sh` script.
```bash
cd $WORKSPACE/GenAIExamples/DocSum/docker_compose
source ./set_env.sh
```

## Deploy the Use Case

In this tutorial, we will be deploying via docker compose with the provided
YAML file.  The docker compose instructions should start all the
above-mentioned services as containers.

```bash
cd $WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/hpu/gaudi
docker compose -f compose.yaml up -d
```


### Checks to Ensure the Services are Running
#### Check Startup and Env Variables
Check the startup log by running `docker compose logs` to ensure there are no errors.
The warning messages print out the variables if they are **NOT** set.

Here are some sample messages if proxy environment variables are not set:

    WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
    WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.

#### Check the Container Status
Check if all the containers launched via docker compose have started.

The DocSum example starts 4 docker containers. Check that these docker
containers are all running, i.e., all the containers  `STATUS` are  `Up`.
You can do this with the `docker ps -a` command.

```bash
CONTAINER ID   IMAGE                                                           COMMAND                  CREATED             STATUS                       PORTS                                       NAMES
8ec82528bcbb   opea/docsum-gradio-ui:${RELEASE_VERSION}                                    "python docsum_ui_gr…"   About an hour ago   Up About an hour             0.0.0.0:5173->5173/tcp, :::5173->5173/tcp   docsum-gaudi-ui-server
e22344ed80d5   opea/docsum:${RELEASE_VERSION}                                              "python docsum.py"       About an hour ago   Up About an hour             0.0.0.0:8888->8888/tcp, :::8888->8888/tcp   docsum-gaudi-backend-server
bbb3c05a2878   opea/llm-docsum:${RELEASE_VERSION}                                          "bash entrypoint.sh"     About an hour ago   Up About an hour             0.0.0.0:9000->9000/tcp, :::9000->9000/tcp   llm-docsum-gaudi-server
d20a8896d2a0   ghcr.io/huggingface/tgi-gaudi:2.3.1                             "text-generation-lau…"   About an hour ago   Up About an hour (healthy)   0.0.0.0:8008->80/tcp, :::8008->80/tcp       tgi-gaudi-server
8213029b6b26   opea/whisper:${RELEASE_VERSION}                                             "python whisper_serv…"   About an hour ago   Up About an hour             0.0.0.0:7066->7066/tcp, :::7066->7066/tcp   whisper-server
```

## Interacting with DocSum for Deployment

This section will walk you through the different ways to interact with
the microservices deployed. After a couple of minutes, rerun `docker ps -a` 
to ensure all the docker containers are still up and running. Then proceed 
to validate each microservice and megaservice. 

### TGI Service

```bash
curl http://${host_ip}:8008/generate \
 -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
 -H 'Content-Type: application/json'
```

Here is the output:

```bash
{"generated_text":"\nDeep learning is a sub-discipline of machine learning. Machine learning is"}

```
### LLM Microservice

```bash
curl http://${host_ip}:9000/v1/docsum \
 -X POST \
  -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5."}' \
 -H 'Content-Type: application/json'
```

The output is the summary of the input given to this microservice.

### Whisper Microservice

```bash
 curl http://${host_ip}:7066/v1/asr \
 -X POST \
     -d '{"audio":"UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}' \
 -H 'Content-Type: application/json'
```

Here is the output:
```bash
 {"asr_result":"you"}
```

### MegaService

You can upload documents (.txt, .doc, .pdf), audio, and video to get a summary of the content.

::::::{tab-set}
:::::{tab-item} Text
:sync: Text

The megaservice accepts input files in txt, pdf, doc format, or plain text in the message parameter.
```bash
curl http://${host_ip}:8888/v1/docsum \
 -H "Content-Type: multipart/form-data" \
    -F "type=text" \
 -F "messages=Text Embeddings Inference (TEI) is a toolkit for deploying and serving open-source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5." \
    -F "max_tokens=32" \
 -F "language=en" \
    -F "stream=true"
```

The output will be the summarization of the text content. We can also upload files and modify the other parameters such as the streaming mode and language.

```bash
curl http://${host_ip}:8888/v1/docsum \
 -H "Content-Type: multipart/form-data" \
   -F "type=text" \
 -F "messages=" \
   -F "files=@/path to your file (.txt, .docx, .pdf)" \
 -F "max_tokens=32" \
   -F "language=en" \
 -F "stream=true"
```
:::::

:::::{tab-item} Audio
:sync: Audio

Audio uploads are not supported through curl command, use the UI to upload it. You can pass base64 string of the audio file as follows :

```bash
curl http://${host_ip}:8888/v1/docsum \
 -H "Content-Type: multipart/form-data" \
   -F "type=audio" \
 -F "messages=UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA" \
   -F "max_tokens=32" \
 -F "language=en" \
   -F "stream=true"
```
:::::

:::::{tab-item} Video
:sync: Video

Video uploads are not supported through curl command, use the UI to upload it. You can pass base64 string of the video file as the value for message parameter as shown here :

```bash
curl http://${host_ip}:8888/v1/docsum \
 -H "Content-Type: multipart/form-data" \
   -F "type=video" \
 -F "messages=convert your video to base64 data type" \
   -F "max_tokens=32" \
 -F "language=en" \
   -F "stream=true"

```
:::::

::::::

When dealing with content longer than the maximum input context of the model being used, we can use different summarization strategies such as auto, stuff, truncate, map_reduce, or refine. Depending on various factors like the model's context size and number of input tokens we can select the strategy that best fits.

1. Auto : In this mode, we will check input token length, if it exceeds MAX_INPUT_TOKENS, the summary_type will automatically be set to refine mode, otherwise will be set to stuff mode.

2. Stuff : In this mode, the LLM generates a summary based on the complete input text. In this case please carefully set MAX_INPUT_TOKENS and MAX_TOTAL_TOKENS according to your model and device memory, otherwise, it may exceed the LLM context limit and raise an error.

3. Truncate : Truncate mode will truncate the input text and keep only the first chunk, whose length is equal to min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS).

4. Map_reduce : Map_reduce mode will split the inputs into multiple chunks, map each document to an individual summary, then consolidate those summaries into a single global summary. stream=True is not allowed here. In this mode, default chunk_size is set to be min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS).

5. Refine : Refine mode will split the inputs into multiple chunks, generate a summary for the first one, then combine it with the second, and loop over every remaining chunk to get the final summary. In this mode, default chunk_size is set to be min(MAX_TOTAL_TOKENS - 2 * input.max_tokens - 128, MAX_INPUT_TOKENS).

We can define the summary_type by providing one of the 5 values discussed above as the value for the summary_type variable as shown below:
```bash
curl http://${host_ip}:8888/v1/docsum \
 -H "Content-Type: multipart/form-data" \
   -F "type=text" \
 -F "messages=" \
   -F "max_tokens=32" \
 -F "files=@/path to your file (.txt, .docx, .pdf)" \
   -F "language=en" \
 -F "summary_type=One of the above 5 types"
```

## Launch UI
### Gradio UI
To access the frontend, open the following URL in your browser: http://{host_ip}:5173. By default, the UI runs on port 5173 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
```yaml
  docsum-gaudi-ui-server:
  image: ${REGISTRY:-opea}/docsum-ui:${TAG:-latest}
  ...
  ports:
  - "5173:5173"
```
### Svelte UI (Optional)
To access the Svelte-based frontend, modify the UI service in the `compose.yaml` file. Replace `docsum-gradio-ui` service with the `docsum-ui` service as per the config below: 
```yaml
docsum-ui:
 image: ${REGISTRY:-opea}/docsum-ui:${TAG:-latest}
 container_name: docsum-gaudi-ui-server
 depends_on:
 - docsum-gaudi-backend-server
 ports:
 - "5173:5173"
 environment:
 - no_proxy=${no_proxy}
 - https_proxy=${https_proxy}
 - http_proxy=${http_proxy}
 - BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
 - DOC_BASE_URL=${BACKEND_SERVICE_ENDPOINT}
 ipc: host
 restart: always
```
Open the following URL in your browser: http://{host_ip}:5173 to access the UI.

### React-Based UI (Optional)
To access the React-based frontend, modify the UI service in the `compose.yaml` file. Replace `docsum-gradio-ui` service with the `docsum-react-ui` service as per the config below:
```yaml
docsum-gaudi-react-ui-server:
  image: ${REGISTRY:-opea}/docsum-react-ui:${TAG:-latest}
  container_name: docsum-gaudi-react-ui-server
  depends_on:
  - docsum-gaudi-backend-server
  ports:
  - "5174:80"
  environment:
  - no_proxy=${no_proxy}
  - https_proxy=${https_proxy}
   - http_proxy=${http_proxy}
 ipc: host
 restart: always
```


Once the services are up, open the following URL in your browser: http://{host_ip}:5174. By default, the UI runs on port 80 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the `compose.yaml` file as shown below:
```yaml
  docsum-gaudi-react-ui-server:
    image: ${REGISTRY:-opea}/docsum-react-ui:${TAG:-latest}
    ...
    ports:
    - "80:80"
```

## Check Docker Container Logs

You can check the log of a container by running this command:

```bash
docker logs <CONTAINER ID> -t
```

You can also check the overall logs with the following command, where the
`compose.yaml` is the megaservice docker-compose configuration file.

Assumming you are still in this directory `$WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/hpu/gaudi`,
run the following command to check the logs:
```bash
docker compose -f compose.yaml logs
```

View the docker input parameters in  `$WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/hpu/gaudi/compose.yaml`

```yaml
tgi-gaudi-server:
    image: ghcr.io/huggingface/tgi-gaudi:2.3.1
    container_name: tgi-gaudi-server
    ports:
      - ${LLM_ENDPOINT_PORT:-8008}:80
    volumes:
      - "${DATA_PATH:-data}:/data"
    environment:
      no_proxy: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
      HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
      HF_HUB_DISABLE_PROGRESS_BARS: 1
      HF_HUB_ENABLE_HF_TRANSFER: 0
      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
      ENABLE_HPU_GRAPH: true
      LIMIT_HPU_GRAPH: true
      USE_FLASH_ATTENTION: true
      FLASH_ATTENTION_RECOMPUTE: true
      host_ip: ${host_ip}
      LLM_ENDPOINT_PORT: ${LLM_ENDPOINT_PORT}
    runtime: habana
    cap_add:
      - SYS_NICE
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://${host_ip}:${LLM_ENDPOINT_PORT}/health || exit 1"]
      interval: 10s
      timeout: 10s
      retries: 100
    command: --model-id ${LLM_MODEL_ID} --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
```

The input `--model-id` is  `${LLM_MODEL_ID}`. Ensure the environment variable `LLM_MODEL_ID` 
is set and spelled correctly. Check spelling. Whenever this is changed, restart the containers to use 
the newly selected model.


## Stop the services

Once you are done with the entire pipeline and wish to stop and remove all the containers, use the command below:
```bash
docker compose down
```