Problem (what breaks first)
Your agent âworksâ in dev.
Then you change something boring:
- a tool schema key
- retry/backoff defaults
- the stop condition
- the model version
And suddenly production looks like:
- 3Ă more tool calls
- 2Ă cost overnight
- runs that never hit
finishand just âstop: budgetâ
If you canât reproduce a run deterministically, you donât have a bug â you have archaeology.
Why this fails in production
Agents fail differently than normal code because theyâre driven by:
- a probabilistic planner (the model)
- a runtime loop (your orchestration)
- side effects (tools)
- external instability (429/5xx, partial responses, timeouts)
Most teams âtestâ the prompt. Thatâs not enough. You need to test the loop contract:
- inputs â actions â tool calls â trace â stop_reason
If stop reasons and traces arenât stable, nothing else is.
Diagram: what a unit-testable agent looks like
Real code: a unit-testable loop (Python + JS)
The trick is boring: dependency injection. Your loop accepts two things it canât control:
llm.next_action(...)tools.call(...)
Everything else should be deterministic and asserted.
from dataclasses import dataclass
from typing import Any, Dict, List, Protocol
@dataclass(frozen=True)
class Budget:
max_steps: int = 10
max_tool_calls: int = 10
class LLM(Protocol):
def next_action(self, state: Dict[str, Any]) -> Dict[str, Any]: ...
class Tools(Protocol):
def call(self, name: str, args: Dict[str, Any]) -> Dict[str, Any]: ...
def run_agent(task: str, *, llm: LLM, tools: Tools, budget: Budget) -> Dict[str, Any]:
trace: List[Dict[str, Any]] = []
tool_calls = 0
state: Dict[str, Any] = {"task": task, "notes": []}
for step in range(budget.max_steps):
action = llm.next_action(state)
trace.append({"step": step, "action": action})
if action.get("type") == "finish":
return {"output": action.get("answer", ""), "trace": trace, "stop_reason": "finish"}
if action.get("type") != "tool":
return {"output": "", "trace": trace, "stop_reason": "invalid_action"}
tool_calls += 1
if tool_calls > budget.max_tool_calls:
return {"output": "", "trace": trace, "stop_reason": "max_tool_calls"}
obs = tools.call(action["tool"], action.get("args", {}))
trace.append({"step": step, "observation": obs, "tool": action["tool"]})
state["notes"].append(obs)
return {"output": "", "trace": trace, "stop_reason": "max_steps"}
# --- unit test (pytest style) ---
class FakeLLM:
def __init__(self):
self.n = 0
def next_action(self, state):
self.n += 1
if self.n == 1:
return {"type": "tool", "tool": "http.get", "args": {"url": "https://example.com"}}
return {"type": "finish", "answer": "ok"}
class FakeTools:
def __init__(self):
self.calls = []
def call(self, name, args):
self.calls.append((name, args))
return {"ok": True, "status": 200, "body": "hello"}
def test_unit_loop_contract():
out = run_agent(
"fetch once and finish",
llm=FakeLLM(),
tools=FakeTools(),
budget=Budget(max_steps=5, max_tool_calls=3),
)
assert out["stop_reason"] == "finish"
assert len(out["trace"]) >= 2export function runAgent(task, { llm, tools, budget }) {
const trace = [];
let toolCalls = 0;
const state = { task, notes: [] };
for (let step = 0; step < budget.maxSteps; step++) {
const action = llm.nextAction(state);
trace.push({ step, action });
if (action?.type === "finish") {
return { output: action.answer ?? "", trace, stop_reason: "finish" };
}
if (action?.type !== "tool") {
return { output: "", trace, stop_reason: "invalid_action" };
}
toolCalls += 1;
if (toolCalls > budget.maxToolCalls) {
return { output: "", trace, stop_reason: "max_tool_calls" };
}
const obs = tools.call(action.tool, action.args || {});
trace.push({ step, tool: action.tool, observation: obs });
state.notes.push(obs);
}
return { output: "", trace, stop_reason: "max_steps" };
}
// --- unit test (jest style) ---
test("unit loop contract", () => {
const llm = {
n: 0,
nextAction() {
this.n += 1;
if (this.n === 1) return { type: "tool", tool: "http.get", args: { url: "https://example.com" } };
return { type: "finish", answer: "ok" };
},
};
const tools = { calls: [], call(name, args) { this.calls.push([name, args]); return { ok: true, status: 200 }; } };
const out = runAgent("fetch once and finish", {
llm,
tools,
budget: { maxSteps: 5, maxToolCalls: 3 },
});
expect(out.stop_reason).toBe("finish");
expect(tools.calls.length).toBe(1);
});Real failure (the one that hurts)
We once changed a retry default in a shared tool wrapper. Nothing âcrashedâ.
But tool calls doubled on a busy route:
- average tool calls/run: 8 â 16
- cost impact: +~$900/day (tokens + tool credits)
- on-call time: ~3 hours to prove it wasnât âthe model being weirdâ
The fix wasnât a better prompt. The fix was a unit test that asserted:
- max tool calls per run stays within a bound
stop_reasontaxonomy doesnât drift- the tool gateway doesnât retry twice (agent retries + tool retries = storm)
Trade-offs
- Unit tests wonât prove the model is âsmartâ. They prove your loop is safe.
- Deterministic stubs can hide real-world tool flakiness (thatâs what replay tests are for).
- Youâll write more boring code. Youâll also page less.
When NOT to use this
Donât unit test âprompt qualityâ as if itâs a deterministic function. If the goal is style/tone, use sampling + evals.
Do unit test:
- budgets
- tool allowlists
- stop reasons
- action schema validation
- idempotency behavior
Copy-paste checklist
- [ ] Inject
llmandtoolsas interfaces (no globals). - [ ] Assert on
stop_reasonfor every test. - [ ] Assert tool calls: count + sequence + args hash (not raw args if sensitive).
- [ ] Test âbad pathsâ: invalid action, tool error, budget stop.
- [ ] One golden test per production incident (yes, really).
Safe default config snippet (YAML)
agent_tests:
budgets:
max_steps: 25
max_tool_calls: 12
invariants:
stop_reason_required: true
action_schema_strict: true
tool_allowlist_required: true
golden_tasks:
- id: "fetch_once"
task: "Fetch https://example.com and summarize in 3 bullets."
expect_stop_reason: "finish"
max_tool_calls: 2
replay:
enabled: true
mode: "record_then_replay"
store: ".agent-replays/"
FAQ (3â5)
Used by patterns
Related failures
Q: Isnât this just mocking the model?
A: Yes â on purpose. Unit tests are for your loop contract: budgets, tool gateway behavior, stop reasons, and trace shape.
Q: What should I assert on?
A: Stop reason, tool call count, tool allowlist decisions, and trace shape. Donât assert on the exact prose output.
Q: How do I test tool flakiness?
A: Record/replay fixtures (or sandboxed integration tests). Unit tests should stay deterministic.
Q: Do I need evals if I have unit tests?
A: Yes. Unit tests stop incidents. Evals catch quality drift. Theyâre different failure classes.
Related pages (3â6 links)
- Foundations: Tool calling · What makes an agent production-ready
- Failures: Tool spam · Budget explosion
- Governance: Budget controls
- Production stack: AI Agent Production Stack