Single node on-prem deployment with TGI on Intel® Xeon® Scalable processor¶
This deployment section covers single-node on-prem deployment of the DocSum example with OPEA comps to deploy using the TGI service. We will be showcasing how to build an e2e DocSum solution with the Intel/neural-chat-7b-v3-3 model, deployed on Intel® Xeon® Scalable processors. To quickly learn about OPEA in just 5 minutes and set up the required hardware and software, please follow the instructions in the Getting Started section.
Overview¶
The DocSum use case uses LLM and ASR microservices. In this tutorial, we will walk through the steps on how to enable it from OPEA GenAIComps to deploy on a single node.
The solution is aimed to show how to use the Intel/neural-chat-7b-v3-3 model on the Intel® Xeon® Scalable processors. We will go through how to set up docker containers to start the microservice and megaservice. The solution will then take a document(.txt,.doc,.pdf), audio or video file as the input and generate a summary. It is deployed with a UI with 3 modes to choose from:
Gradio-Based UI
Svelte-Based UI
React-Based UI
If you need to work with multimedia documents, .doc, or .pdf files, it is suggested that you use Gradio UI.
Below is the list of content we will be covering in this tutorial:
Prerequisites
Prepare (Building / Pulling) Docker images
Use case setup
Deploy the use case
Interacting with DocSum deployment
Prerequisites¶
The first step is to clone the GenAIExamples and GenAIComps projects. GenAIComps are fundamental necessary components used to build the examples you find in GenAIExamples and deploy them as microservices. Set an environment variable for the desired release version with the number only (i.e. 1.0, 1.1, etc) and checkout using the tag with that version.
# Set workspace and navigate into it
export WORKSPACE=<path>
cd $WORKSPACE
# Set desired release version - number only
export RELEASE_VERSION=<insert-release-version>
# GenAIComps
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps
git checkout tags/v${RELEASE_VERSION}
cd ..
# GenAIExamples
git clone https://github.com/opea-project/GenAIExamples.git
cd GenAIExamples
git checkout tags/v${RELEASE_VERSION}
cd ..
The example requires you to set the host_ip
to deploy the microservices on the endpoint enabled with ports. Set the host_ip env variable.
export host_ip=$(hostname -I | awk '{print $1}')
Make sure to set up Proxies if you are behind a firewall.
export no_proxy=${your_no_proxy},$host_ip
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
Prepare (Building / Pulling) Docker images¶
This step will involve building/pulling relevant docker images with a step-by-step process along with a sanity check at the end. For DocSum, the following docker images will be needed: llm-docsum and whisper. Additionally, you will need to build docker images for the DocSum megaservice, and UI (Svelte/React UI is optional). In total, there are 4 required docker images and two optional docker images.
Build/Pull Microservice image¶
If you decide to pull the docker containers and not build them locally, you can proceed to the next step where all the necessary containers will be pulled in from the docker hub.
Follow the steps below to build the docker images from within the GenAIComps
folder.
Note: For RELEASE_VERSIONS older than 1.0, you will need to add a ‘v’ in front
of ${RELEASE_VERSION} to reference the correct image on dockerhub.
cd $WORKSPACE/GenAIComps
Build Whisper Service
docker build -t opea/whisper:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/asr/src/integrations/dependency/whisper/Dockerfile .
Build Mega Service images
The Megaservice is a pipeline that channels data through different
microservices, each performing varied tasks. The LLM, whisper microservice, and flow of data are defined in the docsum.py
file. You can also add or
remove microservices and customize the megaservice to suit your needs.
Build the megaservice image for this use case.
cd $WORKSPACE/GenAIExamples/DocSum
docker build -t opea/docsum:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
Build the UI Image
You can build 3 modes of UI
Gradio UI
cd $WORKSPACE/GenAIExamples/DocSum/ui
docker build -t opea/docsum-gradio-ui:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f docker/Dockerfile.gradio .
Svelte UI (Optional)
cd $WORKSPACE/GenAIExamples/DocSum/ui
docker build -t opea/docsum-ui:${RELEASE_VERSION} --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f docker/Dockerfile .
React UI (Optional) If you want a React-based frontend.
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/docsum"
docker build -t opea/docsum-react-ui:${RELEASE_VERSION} --build-arg BACKEND_SERVICE_ENDPOINT=$BACKEND_SERVICE_ENDPOINT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile.react .
Sanity Check
Check if you have the following set of docker images by running the command docker images
before moving on to the next step.
The tags are based on what you set the environment variable RELEASE_VERSION
to.
opea/whisper:${RELEASE_VERSION}
opea/docsum:${RELEASE_VERSION}
opea/docsum-gradio-ui:${RELEASE_VERSION}
opea/docsum-ui:${RELEASE_VERSION}
(optional)opea/docsum-react-ui:${RELEASE_VERSION}
(optional)
Use Case Setup¶
The use case will use the following combination of GenAIComps and tools.
Use Case Components |
Tools |
Model |
Service Type |
---|---|---|---|
LLM |
TGI |
Intel/neural-chat-7b-v3-3 |
OPEA Microservice |
ASR |
Whisper |
openai/whisper-small |
OPEA Microservice |
UI |
NA |
Gateway Service |
Tools and models mentioned in the table are configurable either through the
environment variables or the compose.yaml
file.
Set the necessary environment variables to set up the use case by running the set_env.sh
script.
Here is where the environment variable LLM_MODEL_ID
is set, and you can change it to another model
by specifying the HuggingFace model card ID.
Note: If you wish to run the UI on a web browser on your laptop, you will need to modify BACKEND_SERVICE_ENDPOINT
to use localhost
or 127.0.0.1
instead of host_ip
inside set_env.sh
for the backend to properly receive data from the UI. Additionally, you will need to port-forward the port used for BACKEND_SERVICE_ENDPOINT
. Specifically, for DocSum, append the following to your ssh command:
-L 8888:localhost:8888
Run the set_env.sh
script.
cd $WORKSPACE/GenAIExamples/DocSum/docker_compose
source ./set_env.sh
Deploy the Use Case¶
In this tutorial, we will be deploying via docker compose with the provided YAML file. The docker compose instructions should start all the above-mentioned services as containers.
cd $WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/cpu/xeon
docker compose up -d
Checks to Ensure the Services are Running¶
Check Startup and Env Variables¶
Check the startup log by running docker compose logs
to ensure there are no errors.
The warning messages print out the variables if they are NOT set.
Here are some sample messages if proxy environment variables are not set:
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "no_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "http_proxy" variable is not set. Defaulting to a blank string.
WARN[0000] The "https_proxy" variable is not set. Defaulting to a blank string.
Check the Container Status¶
Check if all the containers launched via docker compose have started.
The DocSum example starts 4 docker containers. Check that these docker
containers are all running, i.e., all the containers STATUS
are Up
.
You can do this with the docker ps -a
command.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8ec82528bcbb opea/docsum-gradio-ui:latest "python docsum_ui_gr…" About an hour ago Up About an hour 0.0.0.0:5173->5173/tcp, :::5173->5173/tcp docsum-xeon-ui-server
e22344ed80d5 opea/docsum:latest "python docsum.py" About an hour ago Up About an hour 0.0.0.0:8888->8888/tcp, :::8888->8888/tcp docsum-xeon-backend-server
bbb3c05a2878 opea/llm-docsum:latest "bash entrypoint.sh" About an hour ago Up About an hour 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp llm-docsum-server
d20a8896d2a0 ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu "text-generation-lau…" About an hour ago Up About an hour (healthy) 0.0.0.0:8008->80/tcp, :::8008->80/tcp tgi-server
8213029b6b26 opea/whisper:latest "python whisper_serv…" About an hour ago Up About an hour 0.0.0.0:7066->7066/tcp, :::7066->7066/tcp whisper-server
Interacting with DocSum for Deployment¶
This section will walk you through the different ways to interact with
the microservices deployed. After a couple of minutes, rerun docker ps -a
to ensure all the docker containers are still up and running. Then proceed
to validate each microservice and megaservice.
TGI Service¶
curl http://${host_ip}:8008/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
-H 'Content-Type: application/json'
Here is the output:
{"generated_text":"\nDeep learning is a sub-discipline of machine learning. Machine learning is"}
LLM Microservice¶
curl http://${host_ip}:9000/v1/docsum \
-X POST \
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5."}' \
-H 'Content-Type: application/json'
The output is the summary of the input given to this microservice.
Whisper Microservice¶
curl http://${host_ip}:7066/v1/asr \
-X POST \
-d '{"audio":"UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}' \
-H 'Content-Type: application/json'
Here is the output:
{"asr_result":"you"}
MegaService¶
You can upload documents (.txt, .doc, .pdf), audio, and video to get a summary of the content.
The megaservice accepts input files in txt, pdf, doc format, or plain text in the message parameter.
curl http://${host_ip}:8888/v1/docsum \
-H "Content-Type: multipart/form-data" \
-F "type=text" \
-F "messages=Text Embeddings Inference (TEI) is a toolkit for deploying and serving open-source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5." \
-F "max_tokens=32" \
-F "language=en" \
-F "stream=true"
The output will be the summarization of the text content. We can also upload files and modify the other parametes such as the streaming mode and language.
curl http://${host_ip}:8888/v1/docsum \
-H "Content-Type: multipart/form-data" \
-F "type=text" \
-F "messages=" \
-F "files=@/path to your file (.txt, .docx, .pdf)" \
-F "max_tokens=32" \
-F "language=en" \
-F "stream=true"
Audio uploads are not supported through curl command, use the UI to upload it. You can pass base64 string of the audio file as follows :
curl http://${host_ip}:8888/v1/docsum \
-H "Content-Type: multipart/form-data" \
-F "type=audio" \
-F "messages=UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA" \
-F "max_tokens=32" \
-F "language=en" \
-F "stream=true"
Video uploads are not supported through curl command, use the UI to upload it. You can pass base64 string of the video file as the value for the message parameter as shown here :
curl http://${host_ip}:8888/v1/docsum \
-H "Content-Type: multipart/form-data" \
-F "type=video" \
-F "messages=convert your video to base64 data type" \
-F "max_tokens=32" \
-F "language=en" \
-F "stream=true"
When dealing with longer context of the content to be summarized, we can use different summarization strategies such as auto, stuff, truncate, map_reduce, or refine. Depending on various factors like models context size and input tokens we can select the stratergy that best fits.
Auto : In this mode, we will check input token length, if it exceeds MAX_INPUT_TOKENS, summary_type will automatically be set to refine mode, otherwise will be set to stuff mode.
Stuff : In this mode, LLM generates a summary based on a complete input text. In this case please carefully set MAX_INPUT_TOKENS and MAX_TOTAL_TOKENS according to your model and device memory, otherwise, it may exceed LLM context limit and raise an error when meeting long context.
Truncate : Truncate mode will truncate the input text and keep only the first chunk, whose length is equal to min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS).
Map_reduce : Map_reduce mode will split the inputs into multiple chunks, map each document to an individual summary, then consolidate those summaries into a single global summary. stream=True is not allowed here. In this mode, default chunk_size is set to be min(MAX_TOTAL_TOKENS - input.max_tokens - 50, MAX_INPUT_TOKENS).
Refine : Refine mode will split the inputs into multiple chunks, generate a summary for the first one, then combine it with the second, and loop over every remaining chunk to get the final summary. In this mode, default chunk_size is set to be min(MAX_TOTAL_TOKENS - 2 * input.max_tokens - 128, MAX_INPUT_TOKENS).
We can define the summary_type by providing one of the 5 values discussed above as the value for the summary_type variable as shown below:
curl http://${host_ip}:8888/v1/docsum \
-H "Content-Type: multipart/form-data" \
-F "type=text" \
-F "messages=" \
-F "max_tokens=32" \
-F "files=@/path to your file (.txt, .docx, .pdf)" \
-F "language=en" \
-F "summary_type=One of the above 5 types"
Launch UI¶
Gradio UI¶
To access the frontend, open the following URL in your browser: http://{host_ip}:5173. By default, the UI runs on port 5173 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the compose.yaml
file as shown below:
docsum-xeon-ui-server:
image: ${REGISTRY:-opea}/docsum-ui:${TAG:-latest}
...
ports:
- "5173:5173"
Svelte UI (Optional)¶
To access the Svelte-based frontend, modify the UI service in the compose.yaml
file. Replace docsum-gradio-ui
service with the docsum-ui
service as per the config below:
docsum-ui:
image: ${REGISTRY:-opea}/docsum-ui:${TAG:-latest}
container_name: docsum-xeon-ui-server
depends_on:
- docsum-xeon-backend-server
ports:
- "5173:5173"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
- DOC_BASE_URL=${BACKEND_SERVICE_ENDPOINT}
ipc: host
restart: always
React-Based UI (Optional)¶
To access the React-based frontend, modify the UI service in the compose.yaml
file. Replace docsum-gradio-ui
service with the docsum-react-ui
service as per the config below:
docsum-xeon-react-ui-server:
image: ${REGISTRY:-opea}/docsum-react-ui:${TAG:-latest}
container_name: docsum-xeon-react-ui-server
depends_on:
- docsum-xeon-backend-server
ports:
- "5174:80"
environment:
- no_proxy=${no_proxy}
- https_proxy=${https_proxy}
- http_proxy=${http_proxy}
ipc: host
restart: always
Once the services are up, open the following URL in your browser: http://{host_ip}:5174. By default, the UI runs on port 80 internally. If you prefer to use a different host port to access the frontend, you can modify the port mapping in the compose.yaml
file as shown below:
docsum-xeon-react-ui-server:
image: ${REGISTRY:-opea}/docsum-react-ui:${TAG:-latest}
...
ports:
- "80:80"
Check Docker Container Logs¶
You can check the log of a container by running this command:
docker logs <CONTAINER ID> -t
You can also check the overall logs with the following command, where the
compose.yaml
is the megaservice docker-compose configuration file.
Assuming you are still in this directory $WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/cpu/xeon
,
run the following command to check the logs:
docker compose -f compose.yaml logs
View the docker input parameters in $WORKSPACE/GenAIExamples/DocSum/docker_compose/intel/cpu/xeon/compose.yaml
tgi-server:
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
container_name: tgi-server
ports:
- ${LLM_ENDPOINT_PORT:-8008}:80
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
host_ip: ${host_ip}
LLM_ENDPOINT_PORT: ${LLM_ENDPOINT_PORT}
healthcheck:
test: ["CMD-SHELL", "curl -f http://${host_ip}:${LLM_ENDPOINT_PORT}/health || exit 1"]
interval: 10s
timeout: 10s
retries: 100
volumes:
- "./data:/data"
shm_size: 1g
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
The input --model-id
is ${LLM_MODEL_ID}
. Ensure the environment variable LLM_MODEL_ID
is set and spelled correctly. Check spelling. Whenever this is changed, restart the containers to use
the newly selected model.
Stop the services¶
Once you are done with the entire pipeline and wish to stop and remove all the containers, use the command below:
docker compose down