FaqGen Accuracy¶
Dataset¶
We evaluate performance on QA dataset Squad_v2. Generate FAQs on “context” columns in validation dataset, which contains 1204 unique records.
First download dataset and put at “./data”.
Extract unique “context” columns, which will be save to ‘data/sqv2_context.json’:
python get_context.py
Generate FAQs¶
Launch FaQGen microservice¶
Please refer to FaQGen microservice, set up an microservice endpoint.
export FAQ_ENDPOINT = "http://${your_ip}:9000/v1/faqgen"
Generate FAQs with microservice¶
Use the microservice endpoint to generate FAQs for dataset.
python generate_FAQ.py
Post-process the output to get the right data, which will be save to ‘data/sqv2_faq.json’.
python post_process_FAQ.py
Evaluate with Ragas¶
Launch TGI service¶
We use “mistralai/Mixtral-8x7B-Instruct-v0.1” as LLM referee to evaluate the model. First we need to launch a LLM endpoint on Gaudi.
export HUGGING_FACE_HUB_TOKEN="your_huggingface_token"
bash launch_tgi.sh
Get the endpoint:
export LLM_ENDPOINT = "http://${ip_address}:8082"
Verify the service:
curl http://${ip_address}:8082/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":128}}' \
-H 'Content-Type: application/json'
Evaluate¶
evaluate the performance with the LLM:
python evaluate.py
Performance Result¶
Here is the tested result for your reference
answer_relevancy |
faithfulness |
context_utilization |
reference_free_rubrics_score |
---|---|---|---|
0.7191 |
0.9681 |
0.8964 |
4.4125 |