How to Evaluate LLMs: Moving Beyond "It Looks Good to Me"

Most teams evaluate LLM outputs by reading a few examples and deciding they "look right." This works until it doesn't. A prompt change that improves one query category silently degrades another. A model update that the provider deploys overnight changes your output quality. Without systematic evaluation, you are flying blind. This guide covers the complete evaluation framework: what to measure, how to build benchmarks, when to use automated vs. human evaluation, and how to run evals as part of your CI/CD pipeline.

Why "Vibes-Based" Evaluation Fails

Three specific failure modes:

Selection bias: You test queries you expect to work. Edge cases, adversarial inputs, and long-tail queries go untested.
Recency bias: You optimize for the last bug report, potentially degrading other areas.
Inconsistency: Different team members have different quality standards. What "looks good" to one person is unacceptable to another.

The 4-Layer Evaluation Framework

Layer 1: Automated Metrics (Run on Every Change)

Automated metrics provide a numerical score that you can track over time. They are not perfect, but they catch regressions instantly.

Metric	What It Measures	Best For
Exact Match	Output matches reference exactly	Classification, extraction
ROUGE / BLEU	N-gram overlap with reference	Summarization, translation
BERTScore	Semantic similarity to reference	Open-ended generation
Faithfulness (RAGAS)	Output grounded in context	RAG applications
Answer Relevancy	Output addresses the question	QA systems
Context Precision	Retrieved context is relevant	Retrieval pipeline

Layer 2: LLM-as-Judge (Weekly)

Use a separate LLM (often a stronger model like GPT-4o) to evaluate outputs on dimensions like helpfulness, accuracy, tone, and completeness. Provide the judge with a rubric and reference answers.

# LLM-as-Judge evaluation prompt

judge_prompt = """

Evaluate the following AI response on a scale of 1-5 for each criteria:

- Accuracy: Are all facts correct and supported by the context?

- Completeness: Does the answer fully address the question?

- Clarity: Is the response well-organized and easy to follow?

- Helpfulness: Would this response satisfy the user?

Question: {question}

Context: {context}

Response: {response}

Reference Answer: {reference}

"""

The LLM judge agrees with human evaluators 80-90% of the time. It's not a replacement for human eval but an excellent screening mechanism.

Layer 3: Human Evaluation (Monthly)

For subjective quality dimensions (tone, brand voice, persuasiveness), human evaluation remains the gold standard. Build a review interface where evaluators rate outputs on your rubric. Use at least 3 evaluators per sample and measure inter-annotator agreement.

Layer 4: Production Monitoring (Continuous)

Track real-world performance signals:

User feedback: Thumbs up/down on responses. Track the feedback rate and negative feedback patterns.
Follow-up rate: If users immediately ask a follow-up, the first response likely didn't answer their question.
Task completion: For agents, did the user's goal get accomplished?
Escalation rate: How often do users request a human agent after interacting with the AI?

How to Build Your Evaluation Dataset

Start with 100 Queries

Cover your top use cases (60%), edge cases (20%), and adversarial inputs (20%). Each query gets a reference answer and expected behavior (should answer, should refuse, should ask for clarification).

Add Production Failures

Every time a user reports a bad response, add it to your eval set. This builds a regression test suite of real-world failures. Within 3 months, you'll have a highly representative benchmark.

Categorize Queries

Tag each query by type (factual, analytical, creative, adversarial) and difficulty (simple, complex, ambiguous). This lets you identify which categories regress when you make changes.

Integrating Evals into CI/CD

Run your automated eval suite on every prompt change, model update, or retrieval modification. Set minimum score thresholds per metric. If a change drops faithfulness below 0.85 or answer relevancy below 0.80, block the deployment.

This is the same discipline as unit testing for traditional software, applied to non-deterministic AI systems. For the broader agent architecture that this fits into, see our production agent blueprint.

Evaluating RAG Systems Specifically

RAG evaluation needs both retrieval metrics and generation metrics. Use the RAGAS framework to measure:

Context Precision: Are the retrieved chunks relevant to the question?
Context Recall: Did the retriever find all necessary information?
Faithfulness: Is the generated answer supported by the context?
Answer Relevancy: Does the answer address the question?

For RAG-specific optimization techniques, see our guides on fixing RAG failures and optimizing RAG at scale.

Frequently Asked Questions

How many eval examples do I need?

Start with 100. Aim for 500+ within 6 months. The more diverse your eval set, the more confident you can be in your quality metrics. Focus on breadth of query types over volume.

Can I use GPT-4 to evaluate GPT-4 outputs?

Yes, with caveats. LLMs tend to rate their own model family's outputs higher (self-preference bias). Mitigate this by using a different model family as the judge (e.g., Claude judging GPT outputs) or by calibrating the judge against human evaluations.

What tools should I use for evaluation?

RAGAS (RAG evaluation), DeepEval (general LLM eval), Langfuse (production monitoring), and custom scripts for domain-specific metrics. For cost management during eval runs, see our guide on reducing OpenAI costs.

Get Serious About AI Quality

We build evaluation pipelines that catch quality regressions before your users do. From benchmark design to CI/CD integration.

Set Up Your Eval Pipeline