Skip to content
Learn · Engineering practice

Eval gates — unit tests for production AI.

Eval gates are how production AI systems avoid silent regression. Every PR or deploy runs a hand-curated golden dataset through the live model; the build blocks if the aggregate score drops below threshold. The pattern is the closest thing AI engineering has to a unit-test discipline — and it's load-bearing.

What lives in the dataset

Each scenario is a hand-curated input plus the expected outcome — the answer your domain expert would defend. The set spans straightforward cases, hard cases the policy clearly answers, and edge cases where escalation is the correct action. Scenario coverage is the load-bearing variable; quality of scoring is secondary.

Eval gate FAQ

What is an eval gate?

A CI-blocking test that runs a golden dataset through the live model on every deploy. The test scores outcome match, citation grounding, and confidence on approves. If the average score drops below the configured threshold, the build exits non-zero and the deploy is blocked.

How big should the golden dataset be?

Minimum 50 hand-curated scenarios per workflow, ideally 100+ for high-stakes verticals. The set is built with the customer's domain expert (medical director, credit head) and updated when policies or guidelines change.

What does it catch?

Prompt regressions, model-version drift, policy-corpus updates that broke retrieval, and the long-tail of changes that quietly degrade outcomes. Without a gate, these slip into production and surface as customer complaints weeks later.

How is this different from traditional unit tests?

Unit tests test deterministic code. Evals test a probabilistic system. The gate isn't 'pass/fail per case' — it's 'aggregate score above threshold'. Individual cases can flip, but the system as a whole must not regress.

Want to see this in your environment?

30-minute discovery call. Draft SOW within 5 business days.

Talk to us about a pilot