Normal path: execute → tool → observe.
Problem-first intro
Nothing changed.
Except:
- somebody updated a prompt “slightly”
- a tool started returning a new field
- the model got a version bump
- your retrieval index updated
The agent still “works”.
But it’s slower. It calls different tools. It makes different decisions. It misses edge cases. Nobody notices until a user does — and users are not gentle QA.
This is silent drift: production behavior changes without an obvious failure.
Quick take
- Drift is inevitable (model/tool/prompt changes); unmeasured drift is the failure.
- Catch drift with golden tasks + replay + canary and alert on behavior deltas.
- Watch operational signals (tool calls, tokens, latency, stop reasons) before correctness complaints.
Why this fails in production
1) Model output is not stable
Even without version changes, model output has variance. With version changes, it’s guaranteed to shift.
If you don’t measure the shift, you don’t notice it.
2) Tools drift too
Tool outputs change:
- schema evolves
- error payloads change
- ordering changes
- defaults change
If your agent is sensitive to those changes, it will drift.
3) Prompts are code (but usually not treated like it)
Prompt edits are often shipped without:
- tests
- rollbacks
- canaries
- metrics
That’s how you get “we changed one sentence and now it calls http.get 10x more”.
4) Drift shows up as cost and latency before it shows up as correctness
The early warnings are usually operational:
- tokens/request creep up
- tool calls/run creep up
- p95 latency creeps up
- stop reasons shift
If you only watch “success rate”, you’ll miss it.
5) The fix is a feedback loop: golden tasks + replay + canary
You need a production-shaped eval loop:
- golden tasks that represent your real workload
- replay of real traces (with redaction)
- canary rollout for model/prompt/tool changes
- alerting on behavior deltas
Implementation example (real code)
This is a minimal “golden tasks” harness:
- runs tasks against current and candidate versions
- compares stop reasons and tool-call counts
- fails if deltas exceed thresholds
from dataclasses import dataclass
@dataclass(frozen=True)
class GoldenTask:
id: str
input: str
def run_agent(version: str, task: GoldenTask) -> dict:
# Pseudo: run your agent with pinned model/prompt/tools config.
return agent_run(version=version, input=task.input) # (pseudo)
def score(run: dict) -> dict:
return {
"stop_reason": run.get("stop_reason"),
"tool_calls": int(run.get("tool_calls", 0)),
"tokens": int(run.get("tokens_total", 0)),
}
def drift_check(
tasks: list[GoldenTask],
*,
baseline: str,
candidate: str,
run_agent_fn,
max_tool_calls_delta: int = 3,
) -> None:
for t in tasks:
b = score(run_agent_fn(baseline, t))
c = score(run_agent_fn(candidate, t))
if c["stop_reason"] != b["stop_reason"]:
raise RuntimeError(f"[{t.id}] stop_reason drift: {b['stop_reason']} -> {c['stop_reason']}")
if c["tool_calls"] > b["tool_calls"] + max_tool_calls_delta:
raise RuntimeError(f"[{t.id}] tool_calls drift: {b['tool_calls']} -> {c['tool_calls']}")export function score(run) {
return {
stopReason: run.stop_reason,
toolCalls: Number(run.tool_calls || 0),
tokens: Number(run.tokens_total || 0),
};
}
export function driftCheck(tasks, { baseline, candidate, runAgent, maxToolCallsDelta = 3 }) {
for (const t of tasks) {
const b = score(runAgent(baseline, t));
const c = score(runAgent(candidate, t));
if (c.stopReason !== b.stopReason) {
throw new Error("[" + t.id + "] stop_reason drift: " + b.stopReason + " -> " + c.stopReason);
}
if (c.toolCalls > b.toolCalls + maxToolCallsDelta) {
throw new Error("[" + t.id + "] tool_calls drift: " + b.toolCalls + " -> " + c.toolCalls);
}
}
}This is intentionally crude. It still catches the most common drift:
- stop reasons changing (new timeouts, new loops)
- tool-call inflation (cost drift)
Then you add task-specific correctness checks. But start with operational drift — it’s easier to measure and it’s usually the first sign.
Example failure case (incident-style, numbers are illustrative)
We upgraded a model version for a support agent without a canary or golden tasks.
The new model was “better at being thorough”.
It also called search.read more often.
Impact over 24 hours (example numbers):
- tool calls/run: 2.8 → 9.6
- p95 latency: 2.7s → 7.4s
- spend: +$460 vs baseline
- correctness didn’t obviously drop, so nobody noticed until the bill did
Fix:
- golden tasks with drift thresholds (tool calls, stop reasons)
- canary rollout (1% traffic) with auto-rollback on spikes
- replay of anonymized real traces weekly
- metrics dashboards: tokens, tool calls, stop reasons, latency
Drift isn’t exciting. It’s just how production breaks when nobody is watching.
Trade-offs
- Golden task suites take time to maintain.
- Canary adds rollout complexity (worth it).
- Some drift is “good” (better answers). You still need to measure it to decide.
When NOT to use
- If your agent is purely informational and low-stakes, you can be looser (still watch spend).
- If you don’t have a stable task distribution yet, start with small smoke tests and build golden tasks over time.
- If you can’t replay traces safely (PII), use synthetic tasks and strict budgets.
Copy-paste checklist
- [ ] Golden tasks representing real workload
- [ ] Replay set from real traces (redacted)
- [ ] Canary rollout with rollback triggers
- [ ] Drift thresholds: tool calls, tokens, latency, stop reasons
- [ ] Model/prompt/tool versions pinned per release
- [ ] Weekly “what changed” review
Safe default config snippet (JSON/YAML)
releases:
canary_percent: 1
rollback_on:
tool_calls_per_run_increase_pct: 50
tokens_per_request_increase_pct: 50
latency_p95_increase_pct: 50
eval:
golden_tasks_required: true
drift_thresholds:
tool_calls_delta: 3
stop_reason_changes: 0
FAQ (3–5)
Used by patterns
Related failures
Related pages (3–6 links)
- Foundations: Why agents fail in production · How LLM limits affect agents
- Failure: Token overuse incidents · Budget explosion
- Governance: Tool permissions (allowlists)
- Production stack: Production agent stack