The Problem
The request looks simple: verify payment for several orders and return a short summary.
Traces show something else: in 14 minutes, one run made 63 steps, 41 tool calls, and burned about $11.80. For this class of task, that is usually around $0.20-0.30.
There is no obvious crash: some calls return 200, the agent is formally "working",
but the run queue grows, and cost_per_run exceeds budget limits within the first minutes.
The system does not fail hard.
It just slowly inflates the bill and the run queue, until spend moves beyond budget boundaries.
Analogy: imagine a taxi meter that is never reset between rides. The car keeps moving, passengers change, but the amount only accumulates. Budget explosion in agents looks the same: work appears to continue, while costs grow faster than value.
Why This Happens
Budget explosion usually does not come from one expensive call, but from missing strict control over accumulated runtime cost.
In production it is typically this mix:
- context and history grow turn by turn, so each new model call gets more expensive;
- one agent step can trigger tool fan-out, and the cost multiplies;
- retries live in several layers and convert a short failure into a long cost wave;
- there is no single budget gate for steps, tokens, tool calls, time, and USD;
- without stop reasons and cost metrics, incidents are noticed only after the invoice.
In traces this appears as simultaneous growth of prompt_tokens, tool_calls, and retry_attempts,
where each next step costs more than the previous one.
Without a runtime-level budget gate, every new step only deepens the incident.
Most Common Failure Patterns
In production, four recurring budget explosion patterns appear most often.
Cumulative context growth (Context cost creep)
Prompt size grows without priorities: history, retrieval, and tool output are added almost without limits.
Typical cause: no max_prompt_tokens, no source caps, and no summarization tier.
Inflated tool fan-out (Tool fan-out spike)
One step triggers too many external calls, often in parallel. Even without errors, this sharply increases run cost.
Typical cause: no per-tool caps and no bounded fan-out.
Cross-layer retry amplification
Retries are performed by runtime, tool gateway, and SDK at the same time. A short dependency degradation becomes a long wave of repeated spending.
Typical cause: retry policy is not centralized in one place.
Queue snowball (Queue cost snowball)
Long expensive runs occupy workers, backlog grows, and new runs also get more expensive because of wait and timeout.
Typical cause: no strict max_seconds, max_steps, and no stop reason for budget overflow.
How To Detect These Issues
Budget explosion is visible in the combination of cost, runtime, and queue metrics.
| Metric | Budget explosion signal | What to do |
|---|---|---|
cost_per_run | sharp growth of single-run cost | enable max_usd and a budget gate before every step |
tool_cost_share | tool spend share becomes disproportionately high | limit fan-out and add per-tool caps |
retry_attempts_per_run | many repeats for the same calls | centralize retries in the tool gateway and add a retry budget |
prompt_tokens_per_run | steady token growth without quality improvement | caps on context sources + summarization |
queue_backlog | queue grows together with long expensive runs | limit max_seconds, terminate runaway runs in a controlled way |
How To Distinguish Budget Explosion From A Truly Expensive Task
Not every expensive task is an incident. The key question: do extra costs provide a predictable quality increase.
Normal if:
- cost grows together with accuracy or coverage for a complex task;
- there is a controlled spending profile for this request class;
cost_per_successremains within target unit economics.
Dangerous if:
- cost grows faster than success rate;
- the same retries and tool signatures repeat without new signal;
- budget "explodes" without changes in task complexity or SLA.
How To Stop This
In practice, this is the pattern:
- define execution budgets:
max_steps,max_seconds,max_prompt_tokens,max_tool_calls,max_usd; - check the budget gate at every step, not only at run end;
- centralize retries in one tool gateway and reject non-retryable errors;
- on limit breach, return a stop reason, partial/fallback response, and an alert.
Minimal guard for budget control:
from dataclasses import dataclass
import time
@dataclass(frozen=True)
class BudgetLimits:
max_steps: int = 30
max_seconds: int = 120
max_prompt_tokens: int = 12000
max_tool_calls: int = 20
max_retries: int = 6
max_usd: float = 2.0
@dataclass
class BudgetUsage:
steps: int = 0
prompt_tokens: int = 0
completion_tokens: int = 0
tool_calls: int = 0
retries: int = 0
model_usd: float = 0.0
tool_usd: float = 0.0
def estimate_model_usd(prompt_tokens: int, completion_tokens: int) -> float:
# Placeholder pricing: replace with your real model pricing.
return (prompt_tokens / 1000) * 0.003 + (completion_tokens / 1000) * 0.015
class BudgetGuard:
def __init__(self, limits: BudgetLimits = BudgetLimits()):
self.limits = limits
self.usage = BudgetUsage()
self.started_at = time.time()
def total_usd(self) -> float:
return self.usage.model_usd + self.usage.tool_usd
def on_step(self) -> None:
self.usage.steps += 1
def on_model_call(self, prompt_tokens: int, completion_tokens: int) -> None:
self.usage.prompt_tokens += prompt_tokens
self.usage.completion_tokens += completion_tokens
self.usage.model_usd = estimate_model_usd(
self.usage.prompt_tokens,
self.usage.completion_tokens,
)
def on_tool_call(self, tool_cost_usd: float = 0.0) -> None:
self.usage.tool_calls += 1
self.usage.tool_usd += tool_cost_usd
def on_retry(self) -> None:
self.usage.retries += 1
def check(self) -> str | None:
elapsed_s = time.time() - self.started_at
if self.usage.steps > self.limits.max_steps:
return "budget:max_steps"
if elapsed_s > self.limits.max_seconds:
return "budget:timeout"
if self.usage.prompt_tokens > self.limits.max_prompt_tokens:
return "budget:prompt_tokens"
if self.usage.tool_calls > self.limits.max_tool_calls:
return "budget:tool_calls"
if self.usage.retries > self.limits.max_retries:
return "budget:retries"
if self.total_usd() > self.limits.max_usd:
return "budget:usd"
return None
This is a basic guard.
In production, it is usually extended with per-tool limits, backoff + jitter,
and separate budgets for model and tool parts.
check() is called after each step before planning the next action.
on_model_call(...) and on_tool_call(...) update usage right after the actual call,
so the stop reason reflects real run cost.
Where This Lives In Architecture
In production, budget explosion control is almost always split across three system layers.
Agent Runtime holds execution budgets, stop reasons, and controlled run termination. This is where budget becomes a rule, not a wish.
Tool Execution Layer controls fan-out, retries, timeouts, and external call cost. If retries are spread across layers, spending almost always multiplies.
Memory Layer controls what goes into prompt and what stays in long-term memory. Without this layer, token cost grows steadily even without harder tasks.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/7
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Do I need exact cost calculation to use a budget guard?
A: No. At start, a conservative estimate is enough. The goal is not accounting, but early stop of runaway runs.
Q: Which limit should I start with?
A: Start with conservative max_usd and max_seconds, then raise only where quality gains are proven.
Q: What if budget is exhausted for an important request?
A: Return an explicit stop reason, show partial result, and offer controlled escalation (higher tier or manual review).
Q: Where should retries live to avoid cost inflation?
A: In one choke point, usually tool gateway. When retries exist in several layers, budget explosion is almost inevitable.
Budget explosion almost never looks like a loud crash. It is a slow financial degradation that is usually visible only in metrics and baseline comparison. So production agents need not only better models, but also strict execution budget control.
Related Pages
If this happens in production, these pages are also useful:
- Why AI agents fail - a general map of production failure modes.
- Token overuse - how context growth becomes cost growth.
- Tool spam - how repeated tool calls inflate budget.
- Tool failure - how error and retry waves increase run cost.
- Agent Runtime - where to place execution budgets and stop reasons.
- Tool Execution Layer - where to keep retries, fan-out, and cost gates.