Problem
From the outside, everything looks stable. Monitoring is quiet, there are no obvious incidents, and success rate is almost the same.
But run metrics show a shift: a week ago this agent closed the task in 2-3 steps, and after a small prompt and model-version update, it now needs 7-9.
The system did not crash.
It just slowly drifted sideways.
Analogy: imagine a store scale that is off by only 1-2 grams each day. On day one this is almost invisible. After a month, the error already affects the whole register. Agent drift works the same way: a small drift in each run creates a large loss at scale.
Why this happens
LLM agents are stochastic systems. A small change in model, prompt, or input data can change step ordering. Even a minor difference in the reasoning loop accumulates into drift over time.
In production, drift usually moves silently:
- model, prompt, tool output, or retrieval data changes;
- the agent formally keeps completing tasks;
- but takes different steps and spends more resources;
- without baseline comparison, it looks like "everything is fine".
The problem is not one specific change. The problem is missing release control that catches baseline deviation early.
Which failures happen most often
To keep it practical, production teams usually separate four drift types.
Model drift
After an LLM version change, the agent starts ranking steps differently: it "double-checks" more often, finishes runs later, or chooses another tool.
Typical cause: model version was updated without baseline comparison on a golden task set.
Prompt drift
A small edit in the system prompt changes agent priorities: it becomes "too cautious" or "too active".
Typical cause: prompt was changed as plain text, not as production code with tests and canary.
Tool contract drift
A tool returns a new field, a different error format, or an empty array instead of null.
The agent interprets it differently and changes its decision loop.
In production this can easily become tool failure or tool spam.
Retrieval and context drift
Knowledge index changes: new docs were added, ranking changed, more irrelevant chunks entered the context window. The agent still works formally, but picks wrong facts more often.
By symptoms this often looks close to context poisoning.
How to detect these problems
Drift is best seen not in a single metric, but in deviation from baseline.
| Metric | Drift signal | What to do |
|---|---|---|
tool_calls_per_task | slow but stable growth | compare candidate with baseline, add deviation thresholds |
tokens_per_task | higher usage without quality gain | review prompt and caps on tool output |
latency_p95 | degradation after release | canary + automatic rollback on threshold |
stop_reason_distribution | more timeout or max_steps_reached | check new loops and policy changes in prompt |
task_success_rate | almost unchanged, but other metrics are worse | do not trust success rate alone, inspect full run profile |
How to distinguish drift from a useful change
Not every behavior change is bad. Sometimes the new version is truly better. The key question is whether quality improved without disproportionate cost.
Normal if:
- quality increased while
tokens_per_taskandlatency_p95stayed close; - new behavior is stable on golden tasks;
- canary does not show growth in
timeoutandmax_steps_reached.
Dangerous if:
- success rate looks similar, but cost and latency grow;
- stop reasons shift toward limits;
- the agent uses tools more often without accuracy gain.
How to stop these failures
In practice it looks like this:
- you make a change (candidate);
- in CI, the drift gate runs tests and compares candidate with baseline over quality/tokens/tool calls/latency/stop reasons;
- if thresholds are violated, release is blocked or rolled back;
- if thresholds are fine, change goes to canary, then full rollout.
baseline
β
candidate evaluation
β
threshold gate
β
canary
β
production
Minimal runtime-CI barrier against drift:
from dataclasses import dataclass
@dataclass(frozen=True)
class Thresholds:
max_tool_calls_delta: int = 2
max_tokens_delta_pct: float = 0.30
max_latency_delta_pct: float = 0.30
allow_stop_reason_change: bool = False
def violates_thresholds(baseline: dict, candidate: dict, t: Thresholds) -> list[str]:
errors: list[str] = []
if candidate["tool_calls"] > baseline["tool_calls"] + t.max_tool_calls_delta:
errors.append("tool_calls_delta_exceeded")
if candidate["tokens"] > baseline["tokens"] * (1 + t.max_tokens_delta_pct):
errors.append("tokens_delta_exceeded")
if candidate["latency_ms"] > baseline["latency_ms"] * (1 + t.max_latency_delta_pct):
errors.append("latency_delta_exceeded")
if (not t.allow_stop_reason_change) and candidate["stop_reason"] != baseline["stop_reason"]:
errors.append("stop_reason_changed")
return errors
This barrier does not do magic. It simply prevents shipping a slower and more expensive regression disguised as a "successful" release.
Where this is implemented in architecture
Drift control is usually split across two layers.
Agent Runtime captures drift signals during execution:
stop_reason_distribution, steps_per_task, tokens_per_task. Without these metrics,
a threshold gate has nothing to compare.
Tool Execution Layer is a source of part of the drift: a changed tool output format, a new retry policy, or a different error contract silently changes agent behavior. This is where tool contracts should be versioned.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/7
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Does drift always mean the model got worse?
A: No. Drift means behavior changed. It becomes bad when the change is unmeasured and uncontrolled.
Q: Can I detect drift only by success rate?
A: No. Success rate usually lags. tool_calls, tokens, latency, and stop reasons move earlier.
Q: Is canary needed for small prompt edits?
A: For high-traffic systems, yes. Even one sentence in a prompt can change the agent's action choice.
Q: What if drift exists but quality is slightly better?
A: Calculate unit economics: cost per successful run in baseline and candidate. If quality is better and new run cost stays within budget, ship and pin the new baseline.
Agent drift almost never looks like a crash. It is a slow regression visible only in baseline comparison. That is why production agents need not only better models, but strict release control.
Related pages
To close drift better in production, see:
- Why AI agents fail - general map of common production failures.
- Budget explosion - how behavioral drift silently inflates costs.
- Tool spam - how to control growth of unnecessary tool calls.
- Context poisoning - how context issues hide as "strange" agent decisions.
- Agent Runtime - where to place release gates, limits, and stop reasons.
- Tool Execution Layer - where validation, retries, and call control should live.