The Agentic Workflow

Eval-Driven Development

definition

Eval-driven development treats evaluations as first-class development artifacts, systematically measuring agent behavior against defined criteria before, during, and after changes — analogous to test-driven development but for non-deterministic AI systems. Instead of manually checking "does this seem right?

Eval-driven development treats evaluations as first-class development artifacts, systematically measuring agent behavior against defined criteria before, during, and after changes — analogous to test-driven development but for non-deterministic AI systems. Instead of manually checking "does this seem right?", eval-driven teams build evaluation datasets that encode expected behavior and run them automatically whenever prompts, tools, or models change. This practice is foundational because without systematic evaluation, prompt changes that improve one use case can silently degrade others, creating a whack-a-mole pattern that prevents meaningful improvement. The key insight is that you can't improve what you don't measure, and with LLM-based systems, subjective "vibes-based" assessment consistently underperforms even simple automated evals. This concept connects to eval frameworks for the tooling infrastructure, agent benchmarks for standardized evaluation suites, quality metrics for defining what to measure, and prompt iteration for the improvement workflow that evals enable.

on the map

Eval-Driven Development The Agentic Workflow

related concepts

Error Recovery Eval Frameworks Quality Metrics