# CodeGen Accuracy Benchmark ## Table of Contents - [Purpose](#purpose) - [Evaluation Framework](#evaluation-framework) - [Prerequisites](#prerequisites) - [Environment Setup](#environment-setup) - [Running the Accuracy Benchmark](#running-the-accuracy-benchmark) - [Understanding the Results](#understanding-the-results) ## Purpose This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts. ## Evaluation Framework We utilize the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness), a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as [HumanEval](https://huggingface.co/datasets/openai_humaneval), [MBPP](https://huggingface.co/datasets/mbpp), and others. ## Prerequisites - A running CodeGen service accessible via an HTTP endpoint. Refer to the main [CodeGen README](../../README.md) for deployment options. - Python 3.8+ environment. - Git installed. ## Environment Setup 1. **Clone the Evaluation Repository:** ```shell git clone https://github.com/opea-project/GenAIEval cd GenAIEval ``` 2. **Install Dependencies:** ```shell pip install -r requirements.txt pip install -e . ``` ## Running the Accuracy Benchmark 1. **Set Environment Variables:** Replace `{your_ip}` with the IP address of your deployed CodeGen service and `{your_model_identifier}` with the identifier of the model being tested (e.g., `Qwen/CodeQwen1.5-7B-Chat`). ```shell export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen" export CODEGEN_MODEL="{your_model_identifier}" ``` _Note: Port `7778` is the default for the CodeGen gateway; adjust if you customized it._ 2. **Execute the Benchmark Script:** The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint. ```shell bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT ``` _Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability._ ## Understanding the Results The results will be printed to the console and saved in `evaluation_results.json`. A key metric is `pass@k`, which represents the percentage of problems solved correctly within `k` generated attempts (e.g., `pass@1` means solved on the first try). Example output snippet: ```json { "humaneval": { "pass@1": 0.7195121951219512 }, "config": { "model": "Qwen/CodeQwen1.5-7B-Chat", "tasks": "humaneval", "instruction_tokens": null, "batch_size": 1, "max_length_generation": 2048, "precision": "fp32", "load_in_8bit": false, "load_in_4bit": false, "left_padding": false, "limit": null, "limit_start": 0, "save_every_k_tasks": -1, "postprocess": true, "allow_code_execution": true, "generation_only": false, "load_generations_path": null, "load_data_path": null, "metric_output_path": "evaluation_results.json", "save_generations": true, "load_generations_intermediate_paths": null, "save_generations_path": "generations.json", "save_references": true, "save_references_path": "references.json", "prompt": "prompt", "max_memory_per_gpu": null, "check_references": false, "codegen_url": "http://192.168.123.104:7778/v1/codegen" } } ``` This indicates a `pass@1` score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.