ChatQnA Benchmark Results¶

Overview¶

ChatQnA deployed on a single node with ICX cores as the head node and supporting 8x Gaudi2 cards. This is based on OPEA v1.3 release helm charts and images using vLLM inferencing platform.

Methodology¶

Tests scale concurrent users from 1 to 256, and each user send 4 queries. Measuring end to end (E2E) latency average for each query, time to first token (TTFT) average and time per output token (TPOT) average.

Hardware and Software Configuration¶

Category	Details
System Summary	1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40 cores, 270W TDP, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB 0 [0], DSA 0 [0], IAA 0 [0], QAT 0 [0], Total Memory 1024GB (32x32GB DDR4 3200 MT/s [3200 MT/s]), BIOS ETM02, microcode 0xd0003b9, 8x Habana Labs Ltd., 4x MT28800 Family [ConnectX-5 Ex], 4x 7T INTEL SSDPF2KX076TZ, 2x 894.3G SAMSUNG MZ1L2960HCJR-00A07, Ubuntu 22.04.3 LTS, 5.15.0-92-generic. Software: WORKLOAD+VERSION, COMPILER, LIBRARIES, OTHER_SW.
Framework	langchain, vLLM, habana framework
Orchestration	k8s/docker
Containers and Virtualization	Kubernetes v1.29.9
Drivers	habana driver 1.20.1-366eb9c
VM vcpu, Memory	160 vCPUs, 1T memory
OPEA Release Version	v1.3
Dataset	pubmed_10.txt
Embedding Model	BAAI/bge-base-en-v1.5
Database	redis
LLM Model	meta-llama/Llama-3.1-8B-Instruct
Precision	bf16
Output Length	1024
Command Line Parameters	python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml –target-node 1 –test-mode oob
Batch Size	256

Benchmark Results¶

Users	E2E Latency Avg (ms)	TTFT Avg (ms)	TPOT Avg (ms)
256	35,034.7	1,042.8	33.1
128	20,996.0	529.8	19.9
64	16,602.1	404.9	15.8
32	14,646.5	260.1	14.0
16	13,669.3	193.7	13.1
8	13,275.2	157.3	12.8
4	13,038.8	127.7	12.5
2	13,059.0	129.4	12.6
1	12,906.5	126.8	12.5

Benchmark Config Yaml¶

Click to Check Benchmark Config Yaml

deploy:
  device: gaudi
  version: 1.3.0
  modelUseHostPath: /home/sdp/opea_benchmark/model
  HUGGINGFACEHUB_API_TOKEN: xxx
  node: [1]
  namespace: default
  timeout: 1000 # timeout in seconds for services to be ready, default 30 minutes
  interval: 5 # interval in seconds between service ready checks, default 5 seconds

  services:
    backend:
      resources:
        enabled: False
        cores_per_instance: "16"
        memory_capacity: "8000Mi"
      replicaCount: [1, 2, 4, 8]

    teirerank:
      enabled: False
      model_id: ""
      resources:
        enabled: False
        cards_per_instance: 1
      replicaCount: [1, 1, 1, 1]

    tei:
      model_id: ""
      resources:
        enabled: False
        cores_per_instance: "80"
        memory_capacity: "20000Mi"
      replicaCount: [1, 2, 4, 8]

    llm:
      engine: vllm
      model_id: "meta-llama/Llama-3.1-8B-Instruct" # mandatory
      replicaCount:
        with_teirerank: [7, 15, 31, 63] # When teirerank.enabled is True
        without_teirerank: [8, 16, 32, 64] # When teirerank.enabled is False
      resources:
        enabled: False
        cards_per_instance: 1
      model_params:
        vllm: # VLLM specific parameters
          batch_params:
            enabled: True
            max_num_seqs: [256]
          token_params:
            enabled: False
            max_input_length: ""
            max_total_tokens: ""
            max_batch_total_tokens: ""
            max_batch_prefill_tokens: ""
        tgi: # TGI specific parameters
          batch_params:
            enabled: True
            max_batch_size: [1, 2, 4, 8] # Each value triggers an LLM service upgrade
          token_params:
            enabled: False
            max_input_length: "1280"
            max_total_tokens: "2048"
            max_batch_total_tokens: "65536"
            max_batch_prefill_tokens: "4096"

    data-prep:
      resources:
        enabled: False
        cores_per_instance: ""
        memory_capacity: ""
      replicaCount: [1, 1, 1, 1]

    retriever-usvc:
      resources:
        enabled: False
        cores_per_instance: "8"
        memory_capacity: "8000Mi"
      replicaCount: [1, 2, 4, 8]

    redis-vector-db:
      resources:
        enabled: False
        cores_per_instance: ""
        memory_capacity: ""
      replicaCount: [1, 1, 1, 1]

    chatqna-ui:
      replicaCount: [1, 1, 1, 1]

    nginx:
      replicaCount: [1, 1, 1, 1]

benchmark:
  # http request behavior related fields
  user_queries: [4, 8, 16, 32, 64, 128, 256, 512, 1024]
  concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256]
  load_shape_type: "constant" # "constant" or "poisson"
  poisson_arrival_rate: 1.0 # only used when load_shape_type is "poisson"
  warmup_iterations: 10
  seed: 1024

  # workload, all of the test cases will run for benchmark
  bench_target: [chatqna_qlist_pubmed]
  dataset: ["/home/sdp/opea_benchmark/pubmed_10.txt"]
  prompt: [10]

  llm:
    # specify the llm output token size
    max_token_size: [1024]