How to benchmark pubmed datasets by send query randomly

This README outlines how to prepare the PubMed datasets for benchmarking ChatQnA and creating a query list based on these datasets. It also explains how to randomly send queries from the list to the ChatQnA pipeline in order to obtain performance data that is more consistent with real user scenarios.

1. prepare the pubmed datasets

  • To simulate a practical user scenario, we have chosen to use industrial data from PubMed. The original PubMed data can be found here: Hugging Face - MedRAG PubMed.

  • In order to observe and compare the performance of the ChatQnA pipeline with different sizes of ingested datasets, we created four files: pubmed_10.txt, pubmed_100.txt, pubmed_1000.txt, and pubmed_10000.txt. These files contain 10, 100, 1,000, and 10,000 records of data extracted from [pubmed23n0001.jsonl]

1.1 get pubmed data

wget https://huggingface.co/datasets/MedRAG/pubmed/resolve/main/chunk/pubmed23n0001.jsonl

1.2 use script to extract data

A prepared script, extract_lines.sh, is available to extract lines from the original pubmed file into the dataset and query list.

Usage:


$ cd dataset
$./extract_lines.sh input_file output_file begin_id end_id

1.3 prepare 4 dataset files

The commands below will generate the 4 pubmed dataset files. And the 4 dataset files will be ingested by dataprep before benchmarking:

./extract_lines.sh  pubmed23n0001.jsonl pubmed_10.txt pubmed23n0001_0 pubmed23n0001_9
./extract_lines.sh  pubmed23n0001.jsonl pubmed_100.txt pubmed23n0001_0 pubmed23n0001_99
./extract_lines.sh  pubmed23n0001.jsonl pubmed_1000.txt pubmed23n0001_0 pubmed23n0001_999
./extract_lines.sh  pubmed23n0001.jsonl pubmed_10000.txt pubmed23n0001_0 pubmed23n0001_9999

1.4 prepare the query list

Basically, the random queries will be based on 10% of the ingested dataset, so we only need to prepare a maximum of 1,000 records for the random query list

cp pubmed_1000.txt pubmed_q1000.txt

2. How to use pubmed qlist

NOTE:
Unlike chatqnafixed.py, which sends a fixed prompt each time, chatqna_qlist_pubmed.py is designed to benchmark the ChatQnA pipeline using the PubMed query list.
Each time it randomly selects a query from the query list file and sends it to the ChatQnA pipeline

  • First make sure use the correct benchmark_target in run.yaml

    bench-target: "chatqna_qlist_pubmed"
    
  • Ensure that the environment variables are set correctly:

    • DATASET: The specific name of the query list file. Default: “pubmed_q1000.txt”

    • MAX_LINES: The maximum number of lines from the query list that will be used for random queries. Default: 1000

    • MAX_TOKENS: The parameter sent to the ChatQnA pipeline to specify the maximum number of tokens the language model can generate. Default: 128

    • PROMPT: A user-defined prompt that will be sent to the ChatQnA pipeline.

  • Then run the benchmark script,for example:

    ./stresscli.py load-test --profile run.yaml

For more information, please refer to the stresscli documentation.