Deterministic vs Probabilistic Evals
You build deterministic evaluations first because they are the cheapest and fastest to run: fixed, rule-based criteria with binary pass/fail outcomes (does the output match the expected regex? does the code compile? do all required fields appear in the JSON?) give you an immediate feedback loop that costs fractions of a cent per run. Probabilistic evaluations use statistical methods or a language model acting as a judge to assess quality on a spectrum, capturing nuanced dimensions like helpfulness or reasoning quality that resist reduction to a rule, but they introduce variance that requires larger sample sizes to trust. The most effective evaluation suites layer both, using deterministic checks as fast guardrails for format validation and regression testing while reserving probabilistic assessment for the quality dimensions that determine whether users are actually satisfied.