Evaluation and Observability

Regression Testing

Regression testing for agent systems verifies that changes to prompts, tools, models, or configurations don't break previously working behavior, catching the "fixed one thing, broke three others" pattern that is endemic to non-deterministic systems. The core challenge that makes this harder than traditional software regression testing is that agent outputs are probabilistic: a test that passed yesterday can fail today on identical input with no code change, because sampling temperature and model inference introduce variance that binary pass/fail assertions cannot absorb. This means agent regression suites require a different approach, using snapshot testing against golden examples, statistical quality measurement across a test suite, or canary deployments that monitor live traffic for degradation rather than asserting a single expected output.

subtopics

Golden Datasets

Snapshot Testing

connected to

Test-Driven Agentic Development

resources

Promptfoo: Regression Testingpromptfoo.devHow to detect and prevent LLM regression with automated guardrails (promptfoo.dev)Braintrust: Experimentsbraintrust.devRunning regression experiments to compare model and prompt configurations (braintrust.dev)Anthropic: Testing Best Practicesdocs.anthropic.comBuilding regression test suites for Claude-based applications (docs.anthropic.com)LangSmith: Testingdocs.smith.langchain.comRegression testing workflows within the LangSmith platform (docs.smith.langchain.com)

view in track