Evaluating GenAI¶
GenAIEval provides evaluation, benchmark, scorecard, and targeting for performance on throughpuut and latency, accuracy on popular evaluation harnesses, safety, and hallucination.
We’re building this documentation from content in the GenAIEval GitHub repository.
- GenAIEval
- Legal Information
- Kubernetes Platform Optimization with Resource Management
- GenAIEval Dockerfiles
- OPEA Benchmark Tool
- Auto-Tuning for ChatQnA: Optimizing Resource Allocation in Kubernetes
- Usage
- Auto-Tuning for ChatQnA: Optimizing Accuracy by Tuning Model Related Parameters
- Setup Prometheus and Grafana to visualize microservice metrics
- StressCli
- How to benchmark pubmed datasets by send query randomly
- locust scripts for OPEA ChatQnA
HELMET: How to Evaluate Long-context Language Models Effectively and Thoroughly
- Benchmarks for agentic applications
- TAG-Bench for evaluating SQL agents
- CRAG Benchmark for Agent QnA systems
- AutoRAG to evaluate the RAG system performance
- 🚀 QuickStart
- Model Card Generator
- 🚀 QuickStart
- Evaluation Methodology
- RAG Pilot - A RAG Pipeline Tuning Tool
- Toxicity Detection Accuracy
- Metric Card for BLEU
- RAGAAF (RAG assessment - Annotation Free)
- OPEA adaption of ragas (LLM-as-a-judge evaluation of Retrieval Augmented Generation)