Evaluation and Observability

Quality Metrics

You cannot improve what you cannot measure, and most agent quality is not obviously measurable: task completion looks binary but collapses when you ask whether the task was completed correctly, efficiently, and safely at the same time. The metrics teams actually track in production are task completion rate, correctness, token and tool-call efficiency, latency, and error rate; the metrics they aspire to track, like code quality or architectural appropriateness, resist automated measurement and require a large language model acting as a judge or periodic human review. Defining metrics before building is critical because they determine what you optimize for: measuring only completion rate produces agents that technically finish tasks with low-quality output, while ignoring cost and latency produces agents that are correct but too slow or expensive to run.

subtopics

Accuracy Metrics

LLM as Judge

connected to

Deterministic vs Probabilistic Evals

resources

Braintrust: Scoringbraintrust.devGuide to defining and computing quality scores for LLM outputs (braintrust.dev)Promptfoo: Assertionspromptfoo.devDefining quality assertions for automated eval comparison (promptfoo.dev)Anthropic: Evaluation Criteriadocs.anthropic.comHow to define quality criteria for Claude-based agent evaluations (docs.anthropic.com)LangSmith: Evaluationdocs.smith.langchain.comMetrics and scoring strategies in LangSmith evaluations (docs.smith.langchain.com)

view in track