Deploy and Benchmark¶

Purpose¶

We aim to run these benchmarks and share them with the OPEA community for three primary reasons:

To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc.

Support Example List¶

ChatQnA
DocSum

Prerequisites¶

Before running the benchmarks, ensure you have:

Kubernetes Environment
- Kubernetes installation: Use kubespray or other official Kubernetes installation guides
- (Optional) Kubernetes set up guide on Intel Gaudi product
Configuration YAML
The configuration file (e.g., ./ChatQnA/benchmark_chatqna.yaml) consists of two main sections: deployment and benchmarking. Required fields with # mandatory comment must be filled with valid values, such as HF_TOKEN. For all other fields, you can either customize them according to our needs or leave them empty (“”) to use the default values from the helm charts.

Default Models:
- LLM: meta-llama/Meta-Llama-3-8B-Instruct (Required: must be specified as it’s shared between deployment and benchmarking phases)
- Embedding: BAAI/bge-base-en-v1.5
- Reranking: BAAI/bge-reranker-base
You can customize which models to use by setting the model_id field in the corresponding service section. Note that the LLM model must be specified in the configuration as it is used by both deployment and benchmarking processes.

Important Notes:
- For Gaudi deployments:
  - LLM service runs on Gaudi devices
  - If enabled, the reranking service (teirerank) also runs on Gaudi devices
- Llama Model Access:
  - Downloading Llama models requires both:
    1. HuggingFace API token
    2. Special authorization from Meta
  - Please visit meta-llama/Meta-Llama-3-8B-Instruct to request access
  - Deployment will fail if model download is unsuccessful due to missing authorization
Node and Replica Configuration:
```
node: [1, 2, 4, 8] # Number of nodes to deploy
replicaCount: [1, 2, 4, 8] # Must align with node configuration
```
The replicaCount values must align with the node configuration by index:
- When deploying on 1 node → uses replicaCount[0] = 1
- When deploying on 2 nodes → uses replicaCount[1] = 2
- When deploying on 4 nodes → uses replicaCount[2] = 4
- When deploying on 8 nodes → uses replicaCount[3] = 8
Note: Model parameters that accept lists (e.g., max_batch_size, max_num_seqs) are deployment parameters that affect model service behavior but not the number of service instances. When these parameters are lists, each value will trigger a service upgrade followed by a new round of testing, while maintaining the same number of service instances.
Install required Python packages Run the following command to install all necessary dependencies:
```
pip install -r requirements.txt
```
notes: the benchmark need opea-eval>=1.3, if v1.3 is not released, please build the opea-eval from source.

Data Preparation¶

Before running benchmarks, you need to:

Prepare Test Data
- Testing for general benchmark target:
  
  Download the retrieval file using the command below for data ingestion in RAG:
```
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
```
- Testing for pubmed benchmark target:
  
  For the chatqna_qlist_pubmed test case, prepare pubmed_${max_lines}.txt by following this README
After the data is prepared, please update the absolute path of this file in the benchmark.yaml file. For example, in the ChatQnA/benchmark_chatqna.yaml file, /home/sdp/upload_file.txt should be replaced by your file path.

Prepare Model Files (Recommended)

pip install -U "huggingface_hub[cli]"
sudo mkdir -p /mnt/models
sudo chmod 777 /mnt/models
huggingface-cli download --cache-dir /mnt/models meta-llama/Meta-Llama-3-8B-Instruct

Running Deploy and Benchmark Tests¶

The benchmarking process consists of two main components: deployment and benchmarking. We provide deploy_and_benchmark.py as a unified entry point that combines both steps.

Running the Tests¶

The script deploy_and_benchmark.py serves as the main entry point. You can use any example’s configuration YAML file. Here are examples using ChatQnA configuration:

For a specific number of nodes:

# Default OOB (Out of Box) mode
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1

# Or specify test mode explicitly
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode [oob|tune]

For all node configurations:

# Default OOB (Out of Box) mode
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml

# Or specify test mode explicitly
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --test-mode [oob|tune]

This will process all node configurations defined in your YAML file.

Test Modes¶

The script provides two test modes controlled by the --test-mode parameter:

OOB (Out of Box) Mode - Default
```
--test-mode oob  # or omit the parameter
```
- Uses enabled configurations only:
  - Resources: Only uses resources when resources.enabled is True
  - Model parameters:
    - Uses batch parameters when batch_params.enabled is True
    - Uses token parameters when token_params.enabled is True
- Suitable for basic functionality testing with selected optimizations
Tune Mode
```
--test-mode tune
```
- Applies all configurations regardless of enabled status:
  - Resource-related parameters:
    - resources.cores_per_instance: CPU cores allocation
    - resources.memory_capacity: Memory allocation
    - resources.cards_per_instance: GPU/Accelerator cards allocation
  - Model parameters:
    - Batch parameters:
      - max_batch_size: Maximum batch size (TGI engine)
      - max_num_seqs: Maximum number of sequences (vLLM engine)
    - Token parameters:
      - max_input_length: Maximum input sequence length
      - max_total_tokens: Maximum total tokens per request
      - max_batch_total_tokens: Maximum total tokens in a batch
      - max_batch_prefill_tokens: Maximum tokens in prefill phase

Choose “oob” mode when you want to selectively enable optimizations, or “tune” mode when you want to apply all available optimizations regardless of their enabled status.

Troubleshooting¶

Helm Chart Directory Issues

During execution, the script downloads and extracts the Helm chart to a directory named after your example
The directory name is derived from your input YAML file path
- For example: if your input is ./ChatQnA/benchmark_chatqna.yaml, the extracted directory will be chatqna/
In some error cases, this directory might not be properly cleaned up

If you encounter deployment issues, check if there’s a leftover Helm chart directory:

# Example: for ./ChatQnA/benchmark_chatqna.yaml
ls -la chatqna/

# Clean up if needed
rm -rf chatqna/

After cleaning up the directory, try running the deployment again

Note: Always ensure there are no leftover Helm chart directories from previous failed runs before starting a new deployment.

ChatQnA Release Data¶

The ChatQnA benchmark results are available in the ChatQnA/benchmark_results.md.