Deep Research Agent Benchmarks¶

Deploy the Deep Research Agent¶

Follow the doc here to setup deep research agent service.

Evaluation¶

python eval.py --datasets together-search-bench --limit 1

The default values for arguments are:

Argument	Default value	Description
–datasets	together-search-bench	benchmark datasets, supports “smolagents:simpleqa”, “hotpotqa”, “simpleqa”, “together-search-bench”
–service-url	http://localhost:8022/v1/deep_research_agent	the deep research agent endpoint
–llm-endpoint	http://localhost:8000/v1/	the llm endpoint, like vllm, for llm as judge
–model	openai/meta-llama/Llama-3.3-70B-Instruct	the model id served by vllm, the prefix openai is the format of litellm

Accuracy¶

We randomly select 30 samples from the dataset togethercomputer/together-search-bench randomly and compare the results of base model and deep research agent. The results show deep research agent can improve the generation quality and accuracy.

model	accuracy
meta-llama/Llama-3.3-70B-Instruct	0.433333
deep research agent with meta-llama/Llama-3.3-70B-Instruct	0.8