# Deep Research Agent Benchmarks ## Deploy the Deep Research Agent Follow the doc [here](https://github.com/opea-project/GenAIExamples/tree/main/DeepResearchAgent) to setup deep research agent service. ## Evaluation ``` python eval.py --datasets together-search-bench --limit 1 ``` The default values for arguments are: | Argument | Default value | Description | | -------------- | -------------------------------------------- | --------------------------------------------------------------------------------------------------- | | --datasets | together-search-bench | benchmark datasets, supports "smolagents:simpleqa", "hotpotqa", "simpleqa", "together-search-bench" | | --service-url | http://localhost:8022/v1/deep_research_agent | the deep research agent endpoint | | --llm-endpoint | http://localhost:8000/v1/ | the llm endpoint, like vllm, for llm as judge | | --model | openai/meta-llama/Llama-3.3-70B-Instruct | the model id served by vllm, the prefix openai is the format of litellm | ## Accuracy We randomly select 30 samples from the dataset [togethercomputer/together-search-bench](https://huggingface.co/datasets/togethercomputer/together-search-bench) randomly and compare the results of base model and deep research agent. The results show deep research agent can improve the generation quality and accuracy. | model | accuracy | | ---------------------------------------------------------------------------------------------------------------------- | -------- | | [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | 0.433333 | | deep research agent with [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | 0.8 |