TAG-Bench for evaluating SQL agents

Overview of TAG-Bench

TAG-Bench is a benchmark published in 2024 by Stanford University and University of California Berkeley and advocated by Databricks to evaluate GenAI systems in answering challenging questions over SQL databases. The questions in TAG-Bench require the GenAI systems to not only able to translate natural language queries into SQL queries, but to combine information from other sources and do reasoning. There are 80 questions in total with 20 in each sub-category of match-based, comparison, ranking, and aggregation queries. The questions are about 5 databases selected from Alibaba’s BIRD Text2SQL benchmark: california_schools, debit_card_specializing, formula_1, codebase_community, and european_football_2. For more information, please read the paper.

Getting started

  1. Set up the environment

export WORKDIR=<your-work-directory>
mkdir $WORKDIR/hf_cache 
export HF_CACHE_DIR=$WORKDIR/hf_cache
export HF_HOME=$HF_CACHE_DIR
export HF_TOKEN=<your-huggingface-api-token>
export HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN
export PYTHONPATH=$PYTHONPATH:$WORKDIR/GenAIEval/
  1. Download this repo in your work directory

cd $WORKDIR
git clone https://github.com/opea-project/GenAIEval.git
  1. Create a conda environment

conda create -n agent-eval-env python=3.10
conda activate agent-eval-env
pip install -r $WORKDIR/GenAIEval/evals/evaluation/agent_eval/docker/requirements.txt
  1. Download data

cd $WORKDIR
git clone https://github.com/TAG-Research/TAG-Bench.git
cd TAG-Bench/setup
chmod +x get_dbs.sh
./get_dbs.sh
  1. Preprocess data

cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/preprocess_data/
bash run_data_split.sh
  1. Generate hints file for each database in TAG-Bench

python3 generate_hints.py

The hints are generated from the description files that come with the TAG-Bench dataset. The hints are simply the column descriptions provided in the dataset. They can be used by the SQL agent to improve performance.

  1. Launch LLM endpoint on Gaudi.

This LLM will be used by agent as well as used as LLM-judge in scoring agent’s answers. By default, meta-llama/Meta-Llama-3.1-70B-Instruct model will be served using 4 Gaudi cards.

# First build vllm image for Gaudi
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/vllm-gaudi
bash build_image.sh

Then launch vllm endpoint with the command below.

bash launch_vllm_gaudi.sh
  1. Validate vllm endpoint is working properly.

python3 test_vllm_endpoint.py

Launch your SQL agent

You can create and launch your own SQL agent. Here we show an example of OPEA sql_agent_llama. Follow the steps below to launch OPEA sql_agent_llama.

  1. Download OPEA GenAIComps repo

cd $WORKDIR
git clone https://github.com/opea-project/GenAIComps.git
  1. Build docker image for OPEA agent

cd $WORKDIR/GenAIComps
export agent_image="opea/agent:comps"
docker build --no-cache -t $agent_image --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy -f comps/agent/src/Dockerfile .
  1. Set up environment for the search_web tool for agent.

export GOOGLE_CSE_ID=<your-GOOGLE_CSE_ID>
export GOOGLE_API_KEY=<your-GOOGLE_API_KEY>

For instructions on how to obtain your GOOGLE_CSE_ID and your-GOOGLE_API_KEY, refer to instructions here.

  1. Launch SQL agent

cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/opea_sql_agent_llama
bash launch_sql_agent.sh california_schools

The command above will launch a SQL agent that interacts with the california_schools database. We also have a script to run benchmarks on all databases.

Run the benchmark

  1. Generate answers

cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/TAG-Bench/run_benchmark
bash run_generate_answer.sh california_schools
  1. Grade the answers

bash run_grading.sh california_schools

Here we use ragas answer_correctness metric to measure the performance. By default, we use meta-llama/Meta-Llama-3.1-70B-Instruct as the LLM judge. We use the vllm endpoint launched in the previous section.

  1. Run the benchmark on all databases

If you want to run all the 80 questions spanning all the 5 different databases, run the command below.

bash run_all_databases.sh

This script will iteratively generate answers and grade answers for questions regarding each database.

Benchmark results

We tested OPEA sql_agent_llama on all 80 questions in TAG-Bench.

Human grading criteria:

  • Score 1: exact match with golden answer

  • Score 0.7: match with golden answer except for the ordering of the entities

  • Score 0.5: missing info, and does not contain info not present in the golden answer

  • Score 0: otherwise

Database

Average human score

Average ragas answer_correctness

california_schools

0.264

0.415

codebase_community

0.262

0.404

debit_card_specializing

0.75

0.753

formula_1

0.389

0.596

european_football_2

0.25

0.666

Overall Average (ours)

0.31 (0.28 if strict exact match)

0.511

Text2SQL (TAG-Bench paper)

0.17

Human performance (TAG-Bench paper)

0.55

We can see that our SQL agent achieved much higher accuracy than Text2SQL, although still lower than human experts.