ChatQnA Benchmark Results

Overview

ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards. This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform.

Methodology

Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average.

Hardware and Software Configuration

Category

Details

System Summary

1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW.

Framework

langchain, vLLM, habana framework

Orchestration

k8s/docker

Containers and Virtualization

Kubernetes v1.29.9

Drivers

habana driver 1.20.1-366eb9c

VM vcpu, Memory

160 vCPUs, 1T memory

OPEA Release Version

v1.3

Dataset

pubmed_10.txt

Embedding Model

BAAI/bge-base-en-v1.5

Database

redis

LLM Model

meta-llama/Llama-3.1-8B-Instruct

Precision

bf16

Output Length

1024

Command Line Parameters

python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml –target-node 1 –test-mode oob

Batch Size

256

Benchmark Results

Users

E2E Latency Avg (ms)

TTFT Avg (ms)

TPOT Avg (ms)

256

35,034.7

1,042.8

33.1

128

20,996.0

529.8

19.9

64

16,602.1

404.9

15.8

32

14,646.5

260.1

14.0

16

13,669.3

193.7

13.1

8

13,275.2

157.3

12.8

4

13,038.8

127.7

12.5

2

13,059.0

129.4

12.6

1

12,906.5

126.8

12.5

Benchmark Config Yaml

Click to Check Benchmark Config Yaml
deploy:
  device: gaudi
  version: 1.3.0
  modelUseHostPath: /home/sdp/opea_benchmark/model
  HUGGINGFACEHUB_API_TOKEN: xxx
  node: [1]
  namespace: default
  timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes
  interval: 5 # interval in seconds between service ready checks, default 5 seconds

  services:
    backend:
      resources:
        enabled: False
        cores_per_instance: "16"
        memory_capacity: "8000Mi"
      replicaCount: [1, 2, 4, 8]

    teirerank:
      enabled: False
      model_id: ""
      resources:
        enabled: False
        cards_per_instance: 1
      replicaCount: [1, 1, 1, 1]

    tei:
      model_id: ""
      resources:
        enabled: False
        cores_per_instance: "80"
        memory_capacity: "20000Mi"
      replicaCount: [1, 2, 4, 8]

    llm:
      engine: vllm
      model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory
      replicaCount:
        with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True
        without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False
      resources:
        enabled: False
        cards_per_instance: 1
      model_params:
        vllm: # VLLM specific parameters
          batch_params:
            enabled: True
            max_num_seqs: [256]
          token_params:
            enabled: False
            max_input_length: ""
            max_total_tokens: ""
            max_batch_total_tokens: ""
            max_batch_prefill_tokens: ""
        tgi: # TGI specific parameters
          batch_params:
            enabled: True
            max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade
          token_params:
            enabled: False
            max_input_length: "1280"
            max_total_tokens: "2048"
            max_batch_total_tokens: "65536"
            max_batch_prefill_tokens: "4096"

    data-prep:
      resources:
        enabled: False
        cores_per_instance: ""
        memory_capacity: ""
      replicaCount: [1, 1, 1, 1]

    retriever-usvc:
      resources:
        enabled: False
        cores_per_instance: "8"
        memory_capacity: "8000Mi"
      replicaCount: [1, 2, 4, 8]

    redis-vector-db:
      resources:
        enabled: False
        cores_per_instance: ""
        memory_capacity: ""
      replicaCount: [1, 1, 1, 1]

    chatqna-ui:
      replicaCount: [1, 1, 1, 1]

    nginx:
      replicaCount: [1, 1, 1, 1]

benchmark:
  # http request behavior related fields
  user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024]
  concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256]
  load_shape_type: "constant" # "constant" or "poisson"
  poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson"
  warmup_iterations: 10
  seed: 1024

  # workload, all of the test cases will run for benchmark
  bench_target: [chatqna_qlist_pubmed]
  dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"]
  prompt: [10]

  llm:
    # specify the llm output token size
    max_token_size: [1024]