CRAG Benchmark for Agent QnA systems¶
Comprehensive RAG (CRAG) benchmark was introduced by Meta in 2024 as a challenge in KDD conference. The CRAG benchmark has questions across five domains and eight question types, and provides a practical set-up to evaluate RAG systems. In particular, CRAG includes questions with answers that change from over seconds to over years; it considers entity popularity and covers not only head, but also torso and tail facts; it contains simple-fact questions as well as 7 types of complex questions such as comparison, aggregation and set questions to test the reasoning and synthesis capabilities of RAG solutions. Additionally, CRAG also provides mock APIs to query mock knowledge graphs so that developers can benchmark additional API calling capabilities for agents. Moreover, golden answers were provided in the dataset, which makes auto-evaluation with LLMs more robust. Therefore, CRAG benchmark is a realistic and comprehensive benchmark for agents.
Getting started¶
Setup a work directory and download this repo into your work directory.
export $WORKDIR=<your-work-directory>
git clone
Build docker image
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/docker/
Set environment vars for downloading models from Huggingface
mkdir $WORKDIR/hf_cache
export HF_CACHE_DIR=$WORKDIR/hf_cache
export HUGGINGFACEHUB_API_TOKEN=<your-hf-api-token>
Start docker container This container will be used to preprocess dataset and run benchmark scripts.
CRAG dataset¶
Download original data and process it with commands below. You need to create an account on the Meta CRAG challenge website. After login, go to this link and download the
file. Then make adatasets
directory in your work directory using the commands below.
mkdir datasets
Then put the crag_task_3_dev_v4.tar.bz2
file in the datasets
directory, and decompress it by running the command below.
cd $WORKDIR/datasets
tar -xf crag_task_3_dev_v4.tar.bz2
Preprocess the CRAG data Data preprocessing directly relates to the quality of retrieval corpus and thus can have significant impact on the agent QnA system. Here, we provide one way of preprocessing the data where we simply extracts all the web search snippets as-is from the dataset per domain. We also extract all the query-answer pairs along with other meta data per domain. You can run the command below to use our method. The data processing will take some time to finish.
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/preprocess_data
Note: This is an example of data processing. You can develop and optimize your own data processing for this benchmark. 3. Sample queries for benchmark The CRAG dataset has more than 4000 queries, and running all of them can be very expensive and time-consuming. You can sample a subset for benchmark. Here we provide a script to sample up to 5 queries per question_type per dynamism in each domain. For example, we were able to get 92 queries from the music domain using the script.
Launch agent QnA system¶
Here we showcase a RAG agent in GenAIExample repo. Please refer to the README in the AgentQnA example for more details. Please note: This is an example. You can build your own agent systems using OPEA components, then expose your own systems as an endpoint for this benchmark. To launch the agent in our AgentQnA example, open another terminal and build images and launch agent system there.
Build images
export $WORKDIR=<your-work-directory>
git clone
cd GenAIExamples/AgentQnA/tests/
Start retrieval tool
Ingest data into vector database and validate retrieval tool
# As an example, we will use the script in AgentQnA example.
# You can write your own script to ingest data.
# As an example, We will ingest the docs of the music domain.
# We will use the crag-eval docker container to run the script.
# The is a client script.
# it will send data-indexing requests to the dataprep server that is part of the retrieval tool.
# So you need to switch back to the terminal where the crag-eval container is running.
cd $WORKDIR/GenAIExamples/AgentQnA/retrieval_tool/
python3 --host_ip $host_ip --filedir ${WORKDIR}/datasets/crag_docs/ --filename crag_docs_music.jsonl
Launch and validate agent endpoint
# Go to the terminal where you launched the AgentQnA example
cd $WORKDIR/GenAIExamples/AgentQnA/tests/
Run CRAG benchmark¶
Once you have your agent system up and running, the next step is to generate answers with agent. Change the variables in the script below and run the script. By default, it will run a sampled set of queries in music domain.
# Come back to the interactive crag-eval docker container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark
Use LLM-as-judge to grade the answers¶
Launch llm endpoint with HF TGI: in another terminal, run the command below. By default,
is used as the LLM judge.
cd llm_judge
Validate that the llm endpoint is working properly.
export host_ip=$(hostname -I | awk '{print $1}')
curl ${host_ip}:8085/generate_stream \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
And then go back to the interactive crag-eval docker, run command below.
# Inside the crag-eval container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/llm_judge/
Grade the answer correctness using LLM judge. We use
metrics from ragas.
# Inside the crag-eval container
cd $WORKDIR/GenAIEval/evals/evaluation/agent_eval/crag_eval/run_benchmark/