Deep Research Agent Benchmarks¶
Deploy the Deep Research Agent¶
Follow the doc here to setup deep research agent service.
Evaluation¶
python eval.py --datasets together-search-bench --limit 1
The default values for arguments are:
Argument |
Default value |
Description |
---|---|---|
–datasets |
together-search-bench |
benchmark datasets, supports “smolagents:simpleqa”, “hotpotqa”, “simpleqa”, “together-search-bench” |
–service-url |
http://localhost:8022/v1/deep_research_agent |
the deep research agent endpoint |
–llm-endpoint |
http://localhost:8000/v1/ |
the llm endpoint, like vllm, for llm as judge |
–model |
openai/meta-llama/Llama-3.3-70B-Instruct |
the model id served by vllm, the prefix openai is the format of litellm |
Accuracy¶
We randomly select 30 samples from the dataset togethercomputer/together-search-bench randomly and compare the results of base model and deep research agent. The results show deep research agent can improve the generation quality and accuracy.
model |
accuracy |
---|---|
0.433333 |
|
deep research agent with meta-llama/Llama-3.3-70B-Instruct |
0.8 |