ChatQnA Benchmark Results¶
Overview¶
ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards. This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform.
Methodology¶
Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average.
Hardware and Software Configuration¶
Category |
Details |
---|---|
System Summary |
1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW. |
Framework |
langchain, vLLM, habana framework |
Orchestration |
k8s/docker |
Containers and Virtualization |
Kubernetes v1.29.9 |
Drivers |
habana driver 1.20.1-366eb9c |
VM vcpu, Memory |
160 vCPUs, 1T memory |
OPEA Release Version |
v1.3 |
Dataset |
pubmed_10.txt |
Embedding Model |
BAAI/bge-base-en-v1.5 |
Database |
redis |
LLM Model |
meta-llama/Llama-3.1-8B-Instruct |
Precision |
bf16 |
Output Length |
1024 |
Command Line Parameters |
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml –target-node 1 –test-mode oob |
Batch Size |
256 |
Benchmark Results¶
Users |
E2E Latency Avg (ms) |
TTFT Avg (ms) |
TPOT Avg (ms) |
---|---|---|---|
256 |
35,034.7 |
1,042.8 |
33.1 |
128 |
20,996.0 |
529.8 |
19.9 |
64 |
16,602.1 |
404.9 |
15.8 |
32 |
14,646.5 |
260.1 |
14.0 |
16 |
13,669.3 |
193.7 |
13.1 |
8 |
13,275.2 |
157.3 |
12.8 |
4 |
13,038.8 |
127.7 |
12.5 |
2 |
13,059.0 |
129.4 |
12.6 |
1 |
12,906.5 |
126.8 |
12.5 |
Benchmark Config Yaml¶
Click to Check Benchmark Config Yaml
deploy:
device: gaudi
version: 1.3.0
modelUseHostPath: /home/sdp/opea_benchmark/model
HUGGINGFACEHUB_API_TOKEN: xxx
node: [1]
namespace: default
timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes
interval: 5 # interval in seconds between service ready checks, default 5 seconds
services:
backend:
resources:
enabled: False
cores_per_instance: "16"
memory_capacity: "8000Mi"
replicaCount: [1, 2, 4, 8]
teirerank:
enabled: False
model_id: ""
resources:
enabled: False
cards_per_instance: 1
replicaCount: [1, 1, 1, 1]
tei:
model_id: ""
resources:
enabled: False
cores_per_instance: "80"
memory_capacity: "20000Mi"
replicaCount: [1, 2, 4, 8]
llm:
engine: vllm
model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory
replicaCount:
with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True
without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False
resources:
enabled: False
cards_per_instance: 1
model_params:
vllm: # VLLM specific parameters
batch_params:
enabled: True
max_num_seqs: [256]
token_params:
enabled: False
max_input_length: ""
max_total_tokens: ""
max_batch_total_tokens: ""
max_batch_prefill_tokens: ""
tgi: # TGI specific parameters
batch_params:
enabled: True
max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade
token_params:
enabled: False
max_input_length: "1280"
max_total_tokens: "2048"
max_batch_total_tokens: "65536"
max_batch_prefill_tokens: "4096"
data-prep:
resources:
enabled: False
cores_per_instance: ""
memory_capacity: ""
replicaCount: [1, 1, 1, 1]
retriever-usvc:
resources:
enabled: False
cores_per_instance: "8"
memory_capacity: "8000Mi"
replicaCount: [1, 2, 4, 8]
redis-vector-db:
resources:
enabled: False
cores_per_instance: ""
memory_capacity: ""
replicaCount: [1, 1, 1, 1]
chatqna-ui:
replicaCount: [1, 1, 1, 1]
nginx:
replicaCount: [1, 1, 1, 1]
benchmark:
# http request behavior related fields
user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024]
concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256]
load_shape_type: "constant" # "constant" or "poisson"
poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson"
warmup_iterations: 10
seed: 1024
# workload, all of the test cases will run for benchmark
bench_target: [chatqna_qlist_pubmed]
dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"]
prompt: [10]
llm:
# specify the llm output token size
max_token_size: [1024]