Idea In 30 Seconds
Eval harness is a way to run the same scenario set for an agent, score results with the same rules, and compare candidate against baseline.
Problem
Without eval harness, teams often test agents manually:
- run several chat requests;
- review a few sample answers;
- conclude that a change looks safe.
This does not provide a stable signal: a change can look fine on random examples, yet break critical production scenarios.
Most common outcomes:
- impossible to compare
candidateandbaselinefairly; - difficult to reproduce regressions;
- CI has no clear rule for when to block release.
Core Concept / Model
Eval harness is not one test. It is a validation pipeline: fixed dataset, controlled run conditions, scoring, comparison with baseline, and reporting.
| Component | What it does |
|---|---|
| Dataset | Stores stable scenarios and expected behavior |
| Runner | Runs the agent on each scenario in the same conditions and collects run outputs |
| Evaluators | Apply deterministic checks, LLM-as-a-judge scoring, and quality metrics |
| Baseline comparator | Compares candidate against baseline |
| Report + CI gate | Builds summary and decides pass/fail for release |
The more stable these components are, the lower the chance that diff between candidate and baseline is caused by run conditions instead of real behavior change.
How It Works
In practice, eval harness runs as part of release pipeline. Every change goes through the same scenario set.
How one eval harness run works
- Dataset - fixed set of cases is loaded.
- Runner - agent runs each case in identical conditions.
- Evaluators - deterministic checks and, when needed, LLM-as-a-judge scoring are applied.
- Baseline comparison -
candidateis compared withbaselineon the same cases. - Report - case-level report and overall summary are saved.
- Gate - CI passes or blocks release based on thresholds.
Eval harness does not replace unit tests. Unit tests validate local components, while harness validates full-system behavior on end-to-end scenarios.
Implementation
In practice, eval harness relies on several simple rules. The examples below are schematic and not tied to one specific framework.
1. Test case structure
case = {
"id": "price_btc_basic",
"input": "What is the price of BTC?",
"expected_tool": "crypto_price_api",
"checks": ["tool_selection", "valid_output_schema"],
}
Clear cases make regression analysis easier and reduce ambiguity during run review.
2. Runner for case execution
def run_case(agent, case):
result = agent.run(case["input"])
return {
"case_id": case["id"],
"selected_tool": result.selected_tool,
"output": result.output,
"stop_reason": result.stop_reason,
}
New version and baseline must run in identical conditions: same timeouts, tool mocks, limits, and runtime environment settings.
3. Scoring and baseline comparison
def evaluate_case(run_result, case):
checks = {
"tool_selection": run_result["selected_tool"] == case["expected_tool"],
"valid_output_schema": isinstance(run_result["output"], dict),
}
return {"passed": all(checks.values()), "checks": checks}
candidate = run_eval_suite(agent=candidate_agent, dataset=dataset)
baseline = load_baseline_report("reports/baseline.json")
diff = compare(candidate, baseline)
For open tasks, deterministic checks are usually extended by LLM-as-a-judge as a separate scoring layer.
Baseline should also be versioned and tied to exact model, prompt, and runtime config.
4. Report and CI gate
summary = build_summary(candidate, diff)
if summary["task_success_rate"] < 0.92:
fail("gate_failed:task_success_rate")
if summary["hallucination_rate"] > 0.03:
fail("gate_failed:hallucination_rate")
write_json("reports/eval-summary.json", summary)
Good eval harness always stores artifacts: case-level outputs, failure reasons, diff against baseline, and final summary.
5. Release gate in overall strategy
Release-blocking criteria and CI gate thresholds are described in Testing Strategy, so they are not duplicated in every article.
Typical Mistakes
Unstable dataset
Scenarios keep changing "on the fly", so results from different runs cannot be compared fairly.
Typical cause: dataset is not versioned and does not have fixed case IDs.
Unpinned model version
LLM providers sometimes update models without changing generic name.
If version is not pinned (for example, gpt-4o-2024-08-06), results can change between runs.
Typical cause: model alias (gpt-4o, sonnet) is used without version pinning.
Production systems usually pin a concrete model version or snapshot version.
Manual runs instead of automation
Harness is executed only when "there is time", not on every meaningful change.
Typical cause: no CI integration and no clear pass/fail gate.
No comparison with baseline
Team looks only at absolute candidate metrics and misses subtle regressions.
Typical cause: report does not include diff between candidate and baseline.
Mixed deterministic and non-deterministic checks
Deterministic checks and LLM-as-a-judge are merged into one "total score", so it is hard to understand what failed.
Typical cause: no separate scoring sections for different check types.
Missing run artifacts
Only final success percentage is stored, without traces and case-level checks.
Typical cause: harness does not persist detailed outputs into report files.
Unstable eval runs
The same case passes and fails randomly, so teams stop trusting reports.
Typical cause: unstable external environment, missing mocks, floating timeouts, or inconsistent run conditions.
Summary
- Eval harness makes agent testing repeatable and comparable.
- Release decision should rely on
candidatevsbaselinediff, not manual examples. - Case-level artifacts matter as much as top-level metrics.
- Without CI gate, eval harness becomes "report for report's sake".
FAQ
Q: Is eval harness just a test suite?
A: No. It is a managed process: dataset, runner, evaluators, baseline comparison, and CI gate.
Q: Can we skip LLM-as-a-judge?
A: Yes, if tasks are well covered by deterministic checks. For open tasks, LLM-as-a-judge is usually added as a separate scoring layer.
Q: How often should eval harness run?
A: At minimum on every change that can affect agent behavior: model, prompts, tools, or runtime rules.
Q: What is most important in first harness version?
A: Stable dataset, saved baseline, clear pass/fail thresholds, and run artifacts.
What Next
For full picture, start with Testing Strategy. Then cover critical logic via Unit Testing, build a stable Golden Datasets, and add Regression Testing for cross-version changes.
When first real incidents appear, add Replay and Debugging and include those cases in your eval harness dataset.