"It looks good" is not a metric. Here is how to build an evaluation pipeline that catches quality regressions before your users do.
Most teams evaluate LLM outputs by reading a few examples and deciding they "look right." This works until it doesn't. A prompt change that improves one query category silently degrades another. A model update that the provider deploys overnight changes your output quality. Without systematic evaluation, you are flying blind. This guide covers the complete evaluation framework: what to measure, how to build benchmarks, when to use automated vs. human evaluation, and how to run evals as part of your CI/CD pipeline.
Three specific failure modes:
Automated metrics provide a numerical score that you can track over time. They are not perfect, but they catch regressions instantly.
| Metric | What It Measures | Best For |
|---|---|---|
| Exact Match | Output matches reference exactly | Classification, extraction |
| ROUGE / BLEU | N-gram overlap with reference | Summarization, translation |
| BERTScore | Semantic similarity to reference | Open-ended generation |
| Faithfulness (RAGAS) | Output grounded in context | RAG applications |
| Answer Relevancy | Output addresses the question | QA systems |
| Context Precision | Retrieved context is relevant | Retrieval pipeline |
Use a separate LLM (often a stronger model like GPT-4o) to evaluate outputs on dimensions like helpfulness, accuracy, tone, and completeness. Provide the judge with a rubric and reference answers.
# LLM-as-Judge evaluation prompt
judge_prompt = """
Evaluate the following AI response on a scale of 1-5 for each criteria:
- Accuracy: Are all facts correct and supported by the context?
- Completeness: Does the answer fully address the question?
- Clarity: Is the response well-organized and easy to follow?
- Helpfulness: Would this response satisfy the user?
Question: {question}
Context: {context}
Response: {response}
Reference Answer: {reference}
"""
The LLM judge agrees with human evaluators 80-90% of the time. It's not a replacement for human eval but an excellent screening mechanism.
For subjective quality dimensions (tone, brand voice, persuasiveness), human evaluation remains the gold standard. Build a review interface where evaluators rate outputs on your rubric. Use at least 3 evaluators per sample and measure inter-annotator agreement.
Track real-world performance signals:
Cover your top use cases (60%), edge cases (20%), and adversarial inputs (20%). Each query gets a reference answer and expected behavior (should answer, should refuse, should ask for clarification).
Every time a user reports a bad response, add it to your eval set. This builds a regression test suite of real-world failures. Within 3 months, you'll have a highly representative benchmark.
Tag each query by type (factual, analytical, creative, adversarial) and difficulty (simple, complex, ambiguous). This lets you identify which categories regress when you make changes.
Run your automated eval suite on every prompt change, model update, or retrieval modification. Set minimum score thresholds per metric. If a change drops faithfulness below 0.85 or answer relevancy below 0.80, block the deployment.
This is the same discipline as unit testing for traditional software, applied to non-deterministic AI systems. For the broader agent architecture that this fits into, see our production agent blueprint.
RAG evaluation needs both retrieval metrics and generation metrics. Use the RAGAS framework to measure:
For RAG-specific optimization techniques, see our guides on fixing RAG failures and optimizing RAG at scale.
Start with 100. Aim for 500+ within 6 months. The more diverse your eval set, the more confident you can be in your quality metrics. Focus on breadth of query types over volume.
Yes, with caveats. LLMs tend to rate their own model family's outputs higher (self-preference bias). Mitigate this by using a different model family as the judge (e.g., Claude judging GPT outputs) or by calibrating the judge against human evaluations.
RAGAS (RAG evaluation), DeepEval (general LLM eval), Langfuse (production monitoring), and custom scripts for domain-specific metrics. For cost management during eval runs, see our guide on reducing OpenAI costs.
We build evaluation pipelines that catch quality regressions before your users do. From benchmark design to CI/CD integration.
Set Up Your Eval Pipeline