How to Evaluate LLM ApplicationsMetrics, Testing and Production Quality Guide
LLM applications cannot be evaluated by asking "does the answer look good?" alone. Production quality requires systematic testing of prompt behaviour, retrieved context, factual accuracy, safety, latency, cost, tool calls, agent decisions, and user satisfaction — before launch and continuously in production.
Answer quality
Relevance, faithfulness, format
Retrieval quality
Precision, recall, context fit
Task success
Goal completion, agent planning
Safety
Policy compliance, toxicity, injection
Performance
Latency, cost, token usage
Production signals
User feedback, monitoring, drift
What Does It Mean to Evaluate an LLM Application?
Quick answer
LLM applications are evaluated by testing the complete system — not just whether an individual response looks good, but whether the entire pipeline reliably produces high-quality, accurate, safe, and cost-efficient outputs at scale. This includes: prompt behaviour, retrieved context quality, generated answer accuracy, factual grounding, safety compliance, latency, token cost, tool call correctness, agent decision quality, and production feedback from real users.
Unlike traditional software tests — which have deterministic pass/fail criteria — LLM evaluation is probabilistic, context-dependent, and requires both automated metrics and human judgement. Building a rigorous evaluation process is one of the most important practices in LLMOps and a prerequisite for safely deploying production AI systems.
Why LLM Evaluation is Different from Normal Software Testing
Standard software tests are deterministic: given the same input, the function always returns the same output. Pass means correct. Fail means broken. LLM applications break most of these assumptions.
Probabilistic outputs
The same input may produce slightly different outputs on every call. Temperature, sampling, and model non-determinism mean your evaluation must aggregate across multiple runs, not a single response.
No single correct answer
Unlike a function that returns 42, an LLM response to "Summarise this document" has no single ground truth. Quality must be evaluated against rubrics, not fixed expected strings.
Context-dependent correctness
The same question may have a correct answer in one knowledge base and a different correct answer in another. RAG systems must be evaluated against the specific documents they have access to.
Retrieval failures compound
In a RAG application, a hallucinated answer and an ungrounded answer look identical from the outside. Retrieval quality must be measured separately from generation quality.
Agent multi-step failures
AI agents can fail at any step of a multi-turn plan. The final answer might be correct despite wrong intermediate steps — or wrong because of a single bad tool call buried in step 3 of 7.
Production monitoring is part of evaluation
Offline evaluation on a test set is necessary but not sufficient. Production traffic exposes edge cases that test sets miss. Real evaluation requires logging, monitoring, and feedback capture in production.
Cost and latency are first-class concerns
A technically correct but $0.50-per-query or 8-second-latency system is not production-ready. Token cost and latency must be measured as part of the evaluation process, not as an afterthought.
LLM Evaluation Workflow
From business goal definition through offline testing to production monitoring and feedback.
Business goal
Define success criteria
Golden dataset
Real questions + rubrics
Offline eval
Metrics, automated tests
Human review
Edge cases, rubric check
RAG / agent checks
Retrieval, tool calls, plans
Production monitoring
Traces, cost, quality signals
Feedback loop
Improve prompts, retrieval, models
Define business goal
What does success look like? What are the failure modes?
Build golden dataset
Real user questions, expected answers or rubrics, edge cases
Run offline evaluation
Automated metrics: relevance, faithfulness, safety, cost
Human review
Verify edge cases, calibrate automated scores
RAG / agent-specific checks
Retrieval quality, tool selection, planning, multi-step traces
Production monitoring
Trace logging, cost per query, quality sampling, user feedback
Feedback loop
Improve prompts, retrieval, model selection based on production data
Core LLM Evaluation Metrics
| Metric | What it checks | Why it matters | Example failure |
|---|---|---|---|
| Answer relevance | Does the response actually address the user's question? | Irrelevant answers destroy trust immediately, even if factually accurate | User asks about refund policy; model explains shipping times |
| Factual accuracy | Is the stated information true and verifiable? | Factual errors create real-world harm in financial, medical, or legal contexts | Model states a product was released in 2022 when it was released in 2024 |
| Faithfulness / groundedness | Is the answer supported by the retrieved context? | Ungrounded answers look correct but cannot be trusted or audited | Model states a fact not present in any retrieved document |
| Hallucination rate | What % of responses contain claims not in the source? | Even low hallucination rates are problematic for high-stakes applications | Model invents a policy clause that does not exist in the company handbook |
| Instruction following | Does the model follow the system prompt and user constraints? | Ignored instructions indicate prompt quality or model alignment issues | Model responds in English despite instruction to respond in Hindi |
| Safety / policy compliance | Does the response stay within permitted boundaries? | Out-of-scope or harmful outputs expose legal, reputational, and user safety risk | Customer service bot gives medical advice it was not permitted to give |
| Toxicity / harmful output | Does the response contain harmful, offensive, or unsafe content? | Toxic outputs damage brand trust and can cause direct harm to users | Model uses offensive language when provoked with adversarial input |
| Format correctness | Is the output in the expected format (JSON, markdown, list)? | Wrong format breaks downstream parsing and user experience | Model returns prose where a structured JSON object was required |
| Latency | How long does a complete response take (TTFB, end-to-end)? | Slow responses hurt UX and may indicate pipeline optimisation issues | Multi-step RAG response takes 12 seconds on a mobile app |
| Cost per response | What is the total token cost of a single query-response cycle? | Unmanaged cost makes LLM features unsustainable at scale | Agent loop generates 15,000 tokens on a simple task due to verbose context |
| Token usage | Input + output token counts per request | Token count is the primary cost and latency driver | System prompt is 8,000 tokens and is re-sent on every turn unnecessarily |
| User satisfaction | User ratings, thumbs up/down, follow-up behaviour | The ultimate measure of whether the application works for real users | Users rephrase the same question three times — the first answer was not useful |
| Task success rate | % of tasks completed correctly end-to-end | For agents and multi-step workflows, partial correctness may not be acceptable | Agent successfully calls tools but returns the wrong final conclusion |
LLM Evaluation Metric Groups
Quality metrics
- →Answer relevance
- →Factual accuracy
- →Faithfulness / groundedness
- →Instruction following
- →Format correctness
- →Task success rate
Retrieval metrics
- →Retrieval precision
- →Retrieval recall
- →Context relevance
- →Context faithfulness
- →Source quality
- →Chunking effectiveness
Safety metrics
- →Hallucination rate
- →Toxicity / harmful content
- →Policy compliance
- →PII leakage
- →Prompt injection resistance
- →Refusal accuracy
Performance metrics
- →Time to first token (TTFB)
- →End-to-end latency
- →Token usage per request
- →Cost per response
- →Throughput (req/s)
- →Retry and error rate
Agent metrics
- →Goal completion rate
- →Tool selection accuracy
- →Tool call correctness
- →Planning quality
- →Step count efficiency
- →Human handoff rate
Business metrics
- →User satisfaction score
- →Conversation resolution rate
- →Escalation rate
- →Return / re-query rate
- →Feature adoption
- →Cost per resolved query
Building a Golden Evaluation Dataset
A golden evaluation dataset is the foundation of reproducible LLM evaluation. It is a curated set of inputs — paired with expected answers, rubrics, or reference documents — that is run against the application on every significant change. Without it, evaluation is manual, slow, and inconsistent.
Use real user questions
Gather questions from actual users, support tickets, or representative user interviews — not questions you invented. Synthetic questions miss the edge cases real users ask.
Include edge cases deliberately
Add questions the model is likely to struggle with: ambiguous phrasing, questions outside the knowledge base, very long inputs, multilingual inputs, adversarial prompts, and questions that should trigger a refusal.
Pair each question with an expected answer or rubric
For factual questions, provide the correct answer. For open-ended questions, provide a scoring rubric ("response should mention X, Y, Z and cite a source"). For agent tasks, specify the expected tool sequence and final outcome.
Include source documents for RAG
For each RAG question, include the specific document chunks that should be retrieved. This enables retrieval precision and recall measurement, not just answer quality.
Include negative cases
Add questions the system should decline to answer (out of scope, unanswerable, policy violations), questions where the correct answer is "I don't know", and questions that should trigger clarification.
Version the dataset
Treat the evaluation dataset like code — commit it, version it, and review changes. When the knowledge base or use case evolves, update the dataset accordingly.
Re-run on every significant change
Run the full evaluation suite whenever prompt templates, model versions, retrieval configuration, embedding models, chunking strategy, or tool definitions change. Treat a drop in scores as a blocking issue, not a warning.
Practical starting point
For most teams, a golden dataset of 50–200 well-curated examples is far more valuable than 1,000 auto-generated synthetic examples. Start small, cover the most important use cases, and grow the dataset as production traffic reveals new failure patterns.
Human Evaluation vs Automated Evaluation
Neither human review nor automated metrics are sufficient alone. A production evaluation setup combines both deliberately.
Human evaluation
- ✓Catches nuanced quality issues automated metrics miss
- ✓Essential for calibrating automated metric thresholds
- ✓Required for high-stakes output review (medical, legal, financial)
- ✓Can assess business relevance and tone appropriateness
- ✗Does not scale to thousands of production responses
- ✗Expensive and slow for regression testing before every deploy
- ✗Inter-annotator agreement is often low without clear rubrics
Automated evaluation
- ✓Scales to run on every deploy and production sample
- ✓Consistent — same rubric applied every time
- ✓Fast enough to block a deployment when scores drop
- ✓Cost and latency tracking is fully automated
- ✗LLM-as-judge can be inconsistent and gameable
- ✗Misses nuanced failures that require context or expertise
- ✗Metric-optimised outputs may still fail user needs
Using LLM-as-a-judge
LLM-as-a-judge uses a separate, often more capable model to evaluate responses against a structured rubric. It scales better than human review but must be used carefully:
- ▸Calibrate the judge against human labels before relying on it
- ▸Use structured rubrics, not vague "rate from 1 to 5" prompts
- ▸Check for positional bias — judges often prefer the first option presented
- ▸Do not use LLM-as-judge for safety evaluation — use dedicated safety classifiers
- ▸Monitor judge model version changes — a provider update may silently shift scores
Best practice: use automated metrics for regression testing and cost monitoring, human review for calibration and edge case auditing, and LLM-as-judge for scalable quality sampling between human review cycles.
How to Evaluate RAG Applications
Evaluating a RAG (Retrieval-Augmented Generation) application requires measuring the retrieval pipeline and the generation step separately. A convincingly-written answer built on poor retrieval will fail on real-world query distributions even if it looks good in a demo.
Retrieval precision
Of the documents retrieved, what fraction are actually relevant to the query? Low precision means the context window is polluted with irrelevant chunks.
Retrieval recall
Of the documents that should have been retrieved, what fraction were actually retrieved? Low recall means important information is missing from the context.
Context relevance
Is the retrieved chunk specifically useful for answering this query? A chunk can be topically related but not contain the specific fact needed.
Context faithfulness
Does the generated answer accurately reflect what is in the retrieved context? This measures whether the model fabricates information not present in the chunks.
Answer groundedness
Is every factual claim in the answer traceable to a specific retrieved source? Ungrounded claims indicate hallucination or over-generalization.
Citation and source quality
Does the model correctly cite the sources it used? Are the cited sources actually authoritative and current?
Chunking quality
Are documents chunked in a way that preserves semantic coherence? Poor chunking splits related information across boundaries, degrading retrieval quality.
"No answer" behaviour
When the knowledge base does not contain the answer, does the system correctly say so — or does it hallucinate a plausible-sounding answer?
Stale document handling
When the knowledge base contains outdated information, is this detected and disclosed? Stale context is a common source of factual errors in production RAG.
How to Evaluate AI Agents
AI agents execute multi-step plans, call external tools, manage memory across turns, and may take irreversible actions. Evaluating agents requires tracing the entire execution path — not just the final output.
Goal completion rate
Does the agent achieve the stated objective? Define completion criteria precisely — partial completions should be scored separately from full failures.
Tool selection accuracy
Does the agent choose the right tool for each reasoning step? Wrong tool selection can cascade through the entire multi-step plan.
Tool call correctness
Does the agent pass valid parameters to the tools it calls? Incorrect parameters cause tool failures that derail the overall task.
Planning quality
Is the step sequence logically sound and efficient? Poor planning produces correct answers through longer, more expensive paths, or fails to reach the goal at all.
Step-by-step trace review
Review full execution traces, not just the final answer. A correct final output may hide dangerous intermediate steps or lucky recoveries from tool failures.
Loop and retry behaviour
Does the agent detect and break out of looping conditions? Does it retry tool failures intelligently rather than indefinitely?
Permission and safety boundaries
Does the agent stay within its permitted action scope? This is especially critical for agents with write access, external communications, or financial actions.
Human handoff quality
When the agent encounters uncertainty or high-stakes decisions, does it escalate to human review at the right point — rather than proceeding autonomously?
Cost and latency per task
Multi-step agent tasks multiply LLM call costs. Track cost and latency per completed task, not per individual step.
How to Evaluate Prompts
Prompts are production artifacts — a small wording change can significantly shift output quality, safety, or format compliance. Prompt evaluation should be treated as rigorously as code review and regression testing.
Instruction following
Does the LLM consistently follow the instructions in the system prompt across varied inputs?
Role and persona clarity
Does the model maintain the intended role, tone, and scope throughout the conversation?
Output format adherence
Does the response match the required format (JSON, markdown, numbered list, specific word count)?
Handling missing information
When the user query is ambiguous or incomplete, does the model ask for clarification rather than guessing?
Prompt injection resistance
Does the model resist user attempts to override system prompt instructions or change its behaviour?
Regression after prompt changes
Does any change to the system prompt affect responses to existing test cases? Run the full golden dataset after any prompt edit.
Version prompts like code
Store every version of each system prompt in version control. Tag the prompt version alongside each evaluation result so you can correlate prompt changes with quality score changes. For more on writing and testing effective prompts, see the LLM Prompt Engineering Guide.
Production Monitoring and LLMOps
Offline evaluation against a test dataset is a necessary starting point, but it is not sufficient. Production traffic exposes failure modes that no static test set anticipates. This is why production monitoring is a core component of LLMOps — not an optional extra.
Trace logging
Log every LLM call with its full prompt, retrieved context, tool calls, and response. Traces are the primary debugging tool when production quality drops.
Prompt and model version tracking
Tag every production request with the active prompt version and model version. This allows quality metrics to be segmented by version — essential for detecting regressions.
Latency monitoring
Track time to first token, end-to-end response time, and retrieval latency separately. Monitor p95 and p99 — averages hide the long tail that hurts user experience most.
Token and cost monitoring
Track input tokens, output tokens, and cost per request. Segment by route, user type, and time period. Alert on anomalous cost spikes that may indicate a loop or context inflation bug.
User feedback capture
Instrument thumbs up/down, explicit ratings, or follow-up rephrasing behaviour as proxy signals for answer quality. Aggregate and review daily.
Quality drift detection
Sample a % of production responses (e.g. 5–10% daily) and run automated evaluation metrics on them. Track scores over time to detect gradual quality degradation.
Failed conversation review
Flag and review conversations where the user abandoned, negatively rated, or re-queried without resolution. These are your most valuable evaluation signals.
Rollback plan
Maintain the previous stable prompt version and model configuration. If production quality drops after a change, you need a documented, tested rollback path — not an emergency improvisation.
Common LLM Evaluation Mistakes
✗ Testing only 10–20 examples
A handful of examples catches obvious failures but misses distributional issues. A golden dataset needs at minimum 50 diverse, real-world examples — and ideally 200+.
✗ Judging by demo quality only
A curated demo is the best-case scenario. Evaluate on a representative random sample of the input distribution, including the unpolished and adversarial cases.
✗ Ignoring retrieval quality
In RAG systems, answer quality is bounded by retrieval quality. A fluent answer built on wrong chunks will be confidently wrong. Measure retrieval separately.
✗ Ignoring refusal and "I don't know" behaviour
Correctly refusing to answer unanswerable or out-of-scope questions is a quality signal, not a failure. Evaluate whether refusals are triggered appropriately.
✗ Ignoring cost and latency
A system that works correctly but costs $2 per query or responds in 15 seconds is not ready for production. These are first-class constraints, not post-launch concerns.
✗ No regression testing
Every change to prompt, model, or retrieval configuration can regress quality in ways that are not immediately obvious. Automated regression tests are non-negotiable.
✗ No production feedback loop
Offline evaluation tells you what you prepared for. Only production feedback tells you what users actually encounter. Capture feedback from day one of launch.
✗ Treating LLM-as-judge as ground truth
LLM judges are useful but inconsistent, biased toward longer answers, and gameable. Calibrate against human labels before relying on them for deployment decisions.
✗ Conflating model quality with system quality
A powerful LLM can still produce bad outputs if the prompt is poor, retrieval is weak, or the knowledge base is outdated. Evaluate the full system, not the model in isolation.
Popular Tools and Frameworks for LLM Evaluation
The LLM evaluation tooling space is evolving quickly. Below are the major categories with representative tools — not endorsements. The right tool depends on your stack, team, and evaluation maturity.
Evaluation frameworks
RAGAS, DeepEval, promptfooPurpose-built frameworks for LLM application testing. RAGAS focuses on RAG-specific metrics (faithfulness, context precision, answer relevance). DeepEval provides a structured test suite for LLM applications. promptfoo supports prompt regression testing via CI integration.
Tracing and observability
LangSmith, Arize AI, Helicone, OpenTelemetryCapture, visualise, and analyse LLM execution traces in production. LangSmith integrates tightly with LangChain/LangGraph. OpenTelemetry-based tracing enables vendor-neutral observability across any stack.
Experiment tracking
MLflow, Weights & Biases (Weave)Track prompt experiments, evaluation runs, and model comparisons. MLflow is widely used across both traditional ML and LLM evaluation workflows. W&B Weave adds LLM-specific experiment tracking on top of the established W&B platform.
Custom evaluation rubrics
Spreadsheets, Notion, AirtableFor early-stage teams, a structured spreadsheet with clearly defined rubrics for each quality dimension is often the fastest way to start. Do not wait for tooling to begin evaluating.
CI integration for eval
GitHub Actions, promptfoo CI, DeepEval CIRun evaluation suites as part of CI/CD pipelines so quality regressions are caught before deployment, not after. Treat a score drop as a build failure.
Production analytics
Custom dashboards, PostHog, DatadogTrack cost per query, latency percentiles, and user satisfaction signals from production traffic. These business-level signals complement technical evaluation metrics.
LLM Application Evaluation Checklist
Before launching an LLM application to production, verify each of the following:
Definition
- Business goal and success criteria are clearly documented
- Failure modes and acceptable risk thresholds are defined
- Out-of-scope queries and required refusal behaviour are specified
Test dataset
- Golden test dataset exists with real user questions
- Each question has an expected answer or scoring rubric
- Edge cases, negative cases, and adversarial inputs are included
- RAG questions include expected source documents
Quality evaluation
- Answer relevance and faithfulness measured on the full test set
- Hallucination rate measured and below acceptable threshold
- Safety and policy compliance tested with adversarial inputs
- Format correctness tested for structured output requirements
RAG and retrieval
- Retrieval precision and recall measured on the test set
- "No answer found" behaviour tested and confirmed correct
- Stale and conflicting document handling verified
Agents and tools
- Tool selection accuracy tested across representative task types
- Tool call correctness verified with edge-case parameters
- Loop prevention and max iteration limits tested
- Permission and safety boundaries verified with boundary-crossing inputs
Performance
- End-to-end latency measured and within acceptable limits
- Cost per response measured and within budget
- Peak load testing completed if applicable
Production readiness
- Trace logging enabled and verified in staging
- Prompt and model version tracking enabled
- User feedback capture (thumbs up/down) implemented
- Automated quality sampling job configured for production
- Rollback plan documented and tested
- Alerting configured for cost spikes, latency regressions, and error rate increases
Next Steps: Learning LLM Evaluation and AI Engineering
LLM evaluation sits at the intersection of prompt engineering, RAG, agents, and LLMOps. The following learning path covers each foundation systematically.
Prompt engineering and structured outputs
Before evaluating outputs, learn to reliably control them. Structured output patterns reduce format failures and make evaluation simpler.
RAG fundamentals and production RAG
Understanding how retrieval works is a prerequisite for evaluating RAG quality. Learn the full RAG pipeline from ingestion through retrieval.
LLMOps and production operations
Evaluation in production requires monitoring, logging, feedback loops, and versioning — the full scope of LLMOps practice.
AI agents and multi-step evaluation
Agent evaluation requires tracing multi-step executions and understanding frameworks like LangGraph and tool integration patterns.
Production deployment and AI engineering
Bring evaluation into CI/CD pipelines, integrate observability, and build the full AI engineering skill set for production AI systems.
Frequently Asked Questions — LLM Application Evaluation
What is LLM evaluation?+
LLM evaluation is the process of systematically measuring the quality, accuracy, safety, and reliability of a large language model application. It goes beyond checking "does the output look good?" to include factual accuracy, retrieval quality (for RAG), planning correctness (for agents), token cost, latency, safety compliance, and user satisfaction. Evaluation runs both offline (against a test dataset before deployment) and online (in production using monitoring, logging, and user feedback).
How do you evaluate an LLM application?+
Evaluate an LLM application by: (1) defining a clear success criterion for the use case, (2) building a golden test dataset with real user questions and expected answers or rubrics, (3) running the test set offline and measuring quality metrics (groundedness, relevance, faithfulness, task success), (4) testing edge cases and negative cases, (5) deploying with tracing and monitoring enabled, (6) capturing user feedback signals, and (7) running evaluation on every significant change to prompt, model, retrieval pipeline, or tool configuration.
What metrics are used to evaluate LLM applications?+
Core LLM evaluation metrics include: answer relevance (does the response address the question?), factual accuracy (is the answer true?), groundedness/faithfulness (is the answer supported by retrieved context?), hallucination rate, instruction following, safety/policy compliance, format correctness, latency, token cost, task success rate, and user satisfaction. For RAG systems, retrieval precision, retrieval recall, context relevance, and source quality are also measured.
How is RAG evaluation different from normal LLM evaluation?+
RAG evaluation must assess the full pipeline — not just the generated answer, but the quality of the retrieval step that feeds it. A correct-sounding answer built on poor retrieval is fragile and will fail on unseen queries. RAG evaluation adds: retrieval precision and recall (are the right documents being retrieved?), context relevance (is the retrieved chunk actually useful for answering the question?), context faithfulness (does the generated answer accurately reflect the retrieved context?), chunking quality, and embedding model quality. Frameworks like RAGAS provide standardised metrics for these.
What is LLM-as-a-judge?+
LLM-as-a-judge is a technique where a separate LLM (often a more capable model) is used to evaluate the output of the application LLM. Instead of comparing outputs to a fixed ground truth, the judge model scores the response on a rubric — for example, rating groundedness, relevance, and helpfulness on a scale. It scales better than human review but must be calibrated carefully: judge models can be inconsistent, can be gamed by sycophantic outputs, and should never be treated as a perfect substitute for human evaluation.
How do you test AI agents?+
AI agent evaluation focuses on the multi-step reasoning and action process, not just the final output. Test: goal completion rate (does the agent achieve the stated objective?), tool selection accuracy (does it pick the right tool for each step?), tool call correctness (does it pass valid parameters?), planning quality (is the step sequence logical?), loop prevention (does it avoid infinite action loops?), safety and permission boundaries (does it stay within scope?), cost and latency per task, and graceful handoff to human when needed.
How often should LLM applications be evaluated?+
Run the full offline evaluation suite before every significant change: prompt updates, model version changes, knowledge base updates, embedding model changes, tool definition updates, or retrieval configuration changes. In production, run continuous sampling-based evaluation (e.g., evaluate 5–10% of live responses daily) and user feedback aggregation. Monitor cost and latency continuously. Conduct a full manual review of failed or low-confidence conversations at least weekly.
Can LLM evaluation be fully automated?+
Not reliably for high-stakes applications. Automated evaluation (LLM-as-judge, RAGAS metrics, regression test suites) scales regression testing and catches clear failures, but misses nuanced quality issues, novel failure modes, and context-dependent correctness. Human review remains essential for calibrating automated metrics, reviewing edge cases, auditing safety-critical outputs, and assessing business-level success. A mature evaluation setup combines automated metrics for speed and human review for depth.
Learn Production AI Engineering
Build and Evaluate Production LLM Systems with Expert Instruction
From prompt engineering and RAG pipelines through LangGraph agents, LLMOps, evaluation, monitoring, and production deployment — Technovids covers the complete AI engineering skill set in live instructor-led training with real production projects.