Evaluation and Observability

Quality Metrics

definition

Quality metrics define what "good agent behavior" looks like in quantitative terms, providing the measurement foundation for eval-driven development and production monitoring. Key metric categories include task completion rate (did the agent finish the task?

Quality metrics define what "good agent behavior" looks like in quantitative terms, providing the measurement foundation for eval-driven development and production monitoring. Key metric categories include task completion rate (did the agent finish the task?), correctness (did it produce the right result?), efficiency (how many tokens and tool calls did it use?), latency (how long did it take?), and safety (did it avoid harmful actions?). The biggest challenge is that many important quality dimensions — like code quality, architectural appropriateness, and user satisfaction — are difficult to measure automatically and may require LLM-as-judge evaluation or human review. Defining the right metrics is critical because they determine what you optimize for: if you only measure completion rate, you'll build agents that "complete" tasks with low-quality output; if you only measure correctness, you'll ignore cost and speed. This concept connects to eval-driven development for the workflow these metrics support, eval frameworks for the tooling that measures them, observability platforms for tracking metrics in production, and cost tracking for the economic dimension of agent performance.

on the map

Quality Metrics Evaluation and Observability

related concepts

Deterministic vs Probabilistic Evals