Purpose

This RFC is used to describe the behavior of unified benchmark script for GenAIExamples user.

In v1.1, those bechmark scripts are per examples. It causes many duplicated codes and bad user experience.

That is why we have motivation to improve such tool to have an unified entry for perf benchmark.

Original benchmark script layout

GenAIExamples/
├── ChatQnA/
│   ├── benchmark/
│   │   ├── benchmark.sh # each example has its own script
│   │   └── deploy.py
│   ├── kubernetes/
│   │   ├── charts.yaml
│   │   └── ...
│   ├── docker-compose/
│   │   └── compose.yaml
│   └── chatqna.py
└── ...

Proposed benchmark script layout

GenAIExamples/
├── deploy_and_benchmark.py # main entry of GenAIExamples 
├── ChatQnA/
│   ├── chatqna.yaml  # default deploy and benchmark config for deploy_and_benchmark.py
│   ├── kubernetes/
│   │   ├── charts.yaml
│   │   └── ...
│   |── docker-compose/
│   |   └── compose.yaml
|   └── chatqna.py
└── ...

Design

The pesudo code of deploy_and_benchmark.py is listed at below for your reference.

# deploy_and_benchmark.py
# below is the pesudo code to demostrate its behavior
#
# def main(yaml_file):
#  # extract all deployment combinations from chatqna.yaml deploy section
#  deploy_traverse_list = extract_deploy_cfg(yaml_file)
#  # for example, deploy_traverse_list = [{'node': 2, 'device': gaudi, 'cards_per_node': 8, ...},
#                                         {'node': 4, 'device': gaudi, 'cards_per_node': 8, ...},
#                                         ...]
#
#  benchmark_traverse_list = extract_benchmark_cfg(yaml_file)
#  # for example, benchmark_traverse_list = [{'concurrency': 128, , 'totoal_query_num': 4096, ...},
#                                            {'concurrency': 128, , 'totoal_query_num': 4096, ...},
#                                             ...]
#  for deploy_cfg in deploy_traverse_list:
#    start_k8s_service(deploy_cfg)
#    for benchmark_cfg in benchmark_traverse_list:
#      if service_ready:
#         ingest_dataset(benchmark_cfg.dataset)
#         send_http_request(benchmark_cfg) # will call stresscli.py in GenAIEval 

Taking chatqna as an example, the configurable fields are listed at below

# chatqna.yaml
#
# usage:
#  1)    deploy_and_benchmark.py --workload chatqna [overrided parameters]
#  2) or deploy_and_benchmark.py ./chatqna/benchmark/chatqna.yaml [overrided parameters]
#
#  for example, deploy_and_benchmark.sh ./chatqna/benchmark/chatqna.yaml --node=2
#
deploy:
   # hardware related config
   device:         [xeon, gaudi, ...] # AMD and other h/ws could be extended into here
   node:           [1, 2, 4]
   cards_per_node: [4, 8]

   # components related config, by default is for OOB, if overrided, then it is for tuned version
   embedding:
      model_id:              bge_large_v1.5
      instance_num:          [2, 4, 8]
      cores_per_instance:    4
      memory_capacity:       20 # unit: G
   retrieval:
      instance_num:       [2, 4, 8]
      cores_per_instance: 4
      memory_capacity:    20 # unit: G
   rerank:
      enable:             True
      model_id:           bge_rerank_v1.5
      instance_num:       1
      cards_per_instance: 1     # if cpu is specified, this field is ignored and will check cores_per_instance field
   llm:
      model_id:           llama2-7b
      instance_num:       7
      cards_per_instance: 1     # if cpu is specified, this field is ignored and will check cores_per_instance field
      # serving related config, dynamic batching
      max_batch_size:     [1, 2, 8, 16, 32]  # the query number to construct a single batch in serving
      max_latency:        20     # time to wait before combining incoming requests into a batch, unit milliseconds

benchmark:
   # http request behavior related fields
   concurrency:               [1, 2, 4]
   totoal_query_num:          [2048, 4096]
   duration:                  [5, 10] # unit minutes
   query_num_per_concurrency: [4, 8, 16]
   possion:                   True
   possion_arrival_rate:      1.0
   warmup_iterations:         10
   seed:                      1024

   # dataset relted fields
   dataset:                [dummy_english, dummy_chinese, pub_med100, ...] # predefined keywords for supported dataset
   user_query:             [dummy_english_qlist, dummy_chinese_qlist, pub_med100_qlist, ...]
   query_token_size:       128                   # if specified, means fixed query token size will be sent out
   data_ratio:             [10%, 20%, ..., 100%] # optional, ratio from query dataset 

   #advance settings in each component which will impact perf.
   data_prep:                  # not target this time
      chunk_size:              [1024]
      chunk_overlap:           [1000]
   retriver:                   # not target this time
      algo:                    IVF
      fetch_k:                 2
      k:                       1
   rerank:
      top_n:                   2
   llm:
      max_token_size:          1024   # specify the output token size