When a traditional API returns a wrong value, you write a unit test. When an AI agent makes a bad decision, you stare at 47 intermediate reasoning steps and wonder where reality diverged from expectation. This is the evaluation problem — and in 2026 it is the single biggest bottleneck between teams that ship reliable agentic systems and teams that ship demo-only toys.
Why Traditional Testing Fails for Agents
Unit tests assume deterministic outputs. Give a function the same input, get the same output, compare against an expected value. Agent behavior breaks this contract in three ways.
First, non-determinism. Even at temperature=0, model outputs vary across API versions, system load, and context window truncation. A test that passes Tuesday may fail Thursday after a silent model update.
Second, emergent failure modes. An agent that correctly handles each individual tool call can still fail catastrophically through poor sequencing, context pollution between steps, or hallucinated tool arguments that look plausible. These failures only appear at the integration level.
Third, the evaluation is itself an LLM call. You cannot write a regex to check whether an agent's research summary was accurate. You need another model — an "evaluator" — to judge output quality. This introduces its own reliability concerns.
The Emerging Evaluation Stack
The open-source evaluation ecosystem has matured quickly. RAGAS (Retrieval-Augmented Generation Assessment) gives you five reference-free metrics: faithfulness, answer relevancy, context precision, context recall, and context entity recall. For RAG-heavy agents, RAGAS has become the de facto starting point. It runs as a Python library and integrates with LangSmith traces out of the box.
DeepEval, built by Confident AI, extends this with 14+ evaluation metrics including G-Eval (LLM-graded open-ended assessment), Hallucination Score, Tool Correctness, and Agent Goal Accuracy. DeepEval's unit test syntax looks familiar to pytest users:
from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall
test_case = LLMTestCase(
input="Book a flight to Berlin for next Tuesday",
actual_output=agent.run("Book a flight to Berlin for next Tuesday"),
tools_called=[ToolCall(name="search_flights", input={...})],
expected_tools=[ToolCall(name="search_flights")]
)
metric = ToolCorrectnessMetric(threshold=0.8)
evaluate([test_case], [metric])LangSmith from LangChain has evolved from a tracing tool into a full evaluation platform. Its Dataset + Evaluator workflow lets teams collect production traces, annotate golden outputs, and run regression tests against new model versions. Anthropic's own teams have started publishing evaluation patterns using LangSmith as the harness.
Braintrust (Series A, $36M) and Patronus AI (Series A, $17M) are the two venture-backed pure-play eval platforms gaining the most enterprise traction. Braintrust's scoring interface lets non-engineers annotate outputs, which matters for teams that need domain experts — lawyers, doctors, financial analysts — to validate agent outputs.
Key Metrics to Track
The metrics that matter depend on your agent's purpose, but a solid baseline for most production agents includes:
- Task Completion Rate (TCR): Does the agent reach the end goal? Binary, easy to compute, should be your north star.
- Tool Call Accuracy: Did the agent call the right tools with the right arguments? Use exact-match for structured APIs and semantic similarity for open-ended inputs.
- Step Efficiency: How many tool calls did it take versus the optimal path? An agent that completes a task in 12 steps when 4 would suffice is burning tokens and latency.
- Faithfulness Score: Are the agent's stated reasons grounded in the actual tool outputs it received? Unfaithful reasoning is a hallucination signal.
- Trajectory Accuracy: Did the agent follow the expected reasoning path? Critical for compliance-sensitive applications in finance and healthcare.
Building an Eval Dataset That Actually Works
The most common mistake teams make is building evaluation datasets from synthetic data. Synthetic evals catch synthetic failures. Real failures come from real user inputs — the ambiguous, the contradictory, the adversarial.
A practical approach: spend the first two weeks after launch in "shadow mode" — log every agent trace without acting on it, then manually review 200 traces with a domain expert. Categorize failures into a taxonomy (wrong tool choice, bad argument, reasoning loop, hallucinated constraint, premature stopping). Build your dataset around these real failure categories, not hypothetical ones.
Aim for at least 500 test cases before calling your eval suite meaningful. Weight edge cases and failure modes more heavily than happy paths — happy paths are almost always fine; it's the edges that kill you in production.
The Eval Engineer Role
In 2025, "evaluation" was a part-time responsibility of whoever built the agent. In 2026, fast-moving companies are hiring dedicated AI Evaluation Engineers — a role that blends ML engineering, QA, and statistical thinking.
Companies like Cohere, Scale AI, Weights & Biases, and Anthropic have all posted dedicated eval roles in Q1 2026, with compensation ranging from $160,000 to $240,000 at the senior level. The role typically owns: the eval framework choice and configuration, the golden dataset curation process, regression test automation in CI/CD, and the metrics dashboards that tell product teams whether a new model version is safe to ship.
The skills that make someone effective in this role are unusual: you need enough ML depth to understand why models fail, enough software engineering to build reliable test infrastructure, and enough product sense to know which failure modes matter to users. Former QA engineers who have leveled up on ML, and former ML researchers who care about production reliability, are both strong fits.
If this role interests you, browse jobs on AgenticCareers.co — we currently track dozens of open eval engineering roles across startups and enterprise AI teams.
Continuous Evaluation in Production
Evaluation cannot be a one-time pre-launch activity. Model providers update weights without notice. Your tool APIs change. User inputs drift. A monthly offline eval run will miss production regressions that emerge within days.
The pattern gaining adoption at mature teams: a canary layer that routes 5% of production traffic to a new model version, a real-time scorer that evaluates outputs using a cheap judge model (GPT-4o-mini works well for this), and an alert threshold that pauses the rollout if TCR drops by more than 3 percentage points versus baseline.
Instrumenting this properly requires roughly two engineer-weeks of initial setup and ongoing maintenance overhead of about 20% of one engineer's time. The teams that invest in it ship new model versions in days rather than weeks, because they trust the safety net.
What to Build Next Quarter
If you are leading an agent team and have no formal eval infrastructure today, here is a prioritized roadmap: (1) integrate DeepEval or RAGAS into your existing pytest suite this week — it takes under a day; (2) stand up LangSmith or Braintrust for trace capture before your next production deployment; (3) conduct a manual review session with a domain expert to build your initial golden dataset; (4) automate regression testing in CI/CD so every PR runs your eval suite.
The teams winning in the agentic era are not necessarily the ones with the best models. They are the ones who know, with confidence, when their agents are working and when they are not.