Action is proposed as structured data (tool + args).
Problem-first intro
You set a “token limit”.
The agent still costs $12.
Because the expensive part wasn’t tokens. It was tools:
- browser automation
- vendor APIs
- OCR
- third-party search
Cost limits are governance because they force a hard question: “Is this run worth another $X?”
If the answer is “maybe”, you need an approval gate, not a longer prompt. If you only cap tokens, you’re basically putting a speed limit on one wheel.
Why this fails in production
1) Cost is multi-dimensional
You pay for:
- model tokens (input + output)
- tool calls (per call, per minute, per document)
- retries (multipliers)
- latency (compute + queue time)
If you only cap one axis, the agent will “escape” via the others.
2) You don’t know cost unless you meter it
Teams discover spend in invoices because they didn’t log:
- tokens/run
- tool_calls/run
- tool_usd/run
- stop_reason
3) Expensive tools need gating, not “be careful”
If browser.run costs $0.20 and the agent can call it 40 times, you built a slot machine.
Put a gate in code:
- tiered budgets
- human approval for expensive actions
- safe defaults: no browser unless needed
Implementation example (real code)
This pattern:
- tracks running spend (model + tools)
- stops when
max_usdis hit - requires approval above a threshold (before calling expensive tools)
from dataclasses import dataclass
from typing import Any
TOOL_USD = {"browser.run": 0.20, "ocr.run": 0.10}
@dataclass(frozen=True)
class CostPolicy:
max_usd: float = 2.00
approval_threshold_usd: float = 0.75
class ApprovalRequired(RuntimeError):
pass
class CostLimitExceeded(RuntimeError):
pass
class CostMeter:
def __init__(self, policy: CostPolicy):
self.policy = policy
self.usd = 0.0
def add_model(self, *, tokens_in: int, tokens_out: int) -> None:
self.usd += (tokens_in + tokens_out) * 0.000002 # placeholder
self._check()
def add_tool(self, *, tool: str) -> None:
self.usd += float(TOOL_USD.get(tool, 0.0))
self._check()
def gate_tool(self, *, tool: str) -> None:
projected = self.usd + float(TOOL_USD.get(tool, 0.0))
if projected >= self.policy.approval_threshold_usd:
raise ApprovalRequired(f"approval required before calling {tool} (projected_usd={projected:.2f})")
def _check(self) -> None:
if self.usd >= self.policy.max_usd:
raise CostLimitExceeded(f"max_usd exceeded ({self.usd:.2f})")
def run(task: str, *, policy: CostPolicy) -> dict[str, Any]:
meter = CostMeter(policy)
while True:
action, tokens_in, tokens_out = llm_decide(task) # (pseudo)
meter.add_model(tokens_in=tokens_in, tokens_out=tokens_out)
if action.kind != "tool":
return {"status": "ok", "answer": action.final_answer, "usd": meter.usd}
meter.gate_tool(tool=action.name)
obs = call_tool(action.name, action.args) # (pseudo)
meter.add_tool(tool=action.name)
task = update(task, action, obs) # (pseudo)const TOOL_USD = { "browser.run": 0.2, "ocr.run": 0.1 };
export class ApprovalRequired extends Error {}
export class CostLimitExceeded extends Error {}
export class CostMeter {
constructor({ maxUsd = 2.0, approvalThresholdUsd = 0.75 } = {}) {
this.maxUsd = maxUsd;
this.approvalThresholdUsd = approvalThresholdUsd;
this.usd = 0;
}
addModel({ tokensIn, tokensOut }) {
this.usd += (tokensIn + tokensOut) * 0.000002; // placeholder
this.check();
}
addTool({ tool }) {
this.usd += Number(TOOL_USD[tool] || 0);
this.check();
}
gateTool({ tool }) {
const projected = this.usd + Number(TOOL_USD[tool] || 0);
if (projected >= this.approvalThresholdUsd) {
throw new ApprovalRequired("approval required before calling " + tool + " (projected_usd=" + projected.toFixed(2) + ")");
}
}
check() {
if (this.usd >= this.maxUsd) throw new CostLimitExceeded("max_usd exceeded (" + this.usd.toFixed(2) + ")");
}
}Real failure case (incident-style, with numbers)
We had an agent that “verified information” by browsing. It was correct. It was also expensive.
Someone changed the prompt to “double-check sources”. That turned into more browser calls.
Impact over 3 days:
- browser calls/run: 1.4 → 6.8
- spend: +$980 vs baseline
- nobody noticed until finance asked
Fix:
- cost meter that combines model + tools
- approval gate for
browser.runabove $0.75 projected spend - a cheap path first: use
kb.read/ cached sources before browsing - alerting on
tool_usd/run
“It’s correct” isn’t a budget policy.
Trade-offs
- Approval gates reduce automation (that’s the point).
- Projected cost is imperfect (still better than unlimited).
- Some tasks need higher budgets; create explicit tiers with better logging.
When NOT to use
- If the agent never calls paid tools, a simple token budget may be enough (still track tokens).
- If your cost model is unknown, start with tool-call caps per tool and add USD later.
- If you can’t build approvals, don’t expose expensive tools to unattended loops.
Copy-paste checklist
- [ ] Track spend per run (model + tools)
- [ ] Cap max USD per run
- [ ] Gate expensive tools behind approvals
- [ ] Prefer cheap tools first (kb/cache) before paid tools
- [ ] Alert on
tool_usd/runspikes and drift - [ ] Return stop reasons users can act on
Safe default config snippet (JSON/YAML)
cost:
max_usd_per_run: 2.0
approval_threshold_usd: 0.75
tools:
priced:
browser.run: 0.20
ocr.run: 0.10
approvals:
required_when_projected_over_threshold: true
FAQ (3–5)
Used by patterns
Related failures
Q: Do I need exact pricing to enforce cost limits?
A: No. Approximate is fine to stop runaway behavior. Tighten it later with real tool billing data.
Q: Should approvals be per-tool or per-run?
A: Start per-tool for expensive actions. Later add per-run tiers (default vs approved) for long investigations.
Q: What if users always approve?
A: Then at least the spend is intentional and auditable. “Accidental spend” is what hurts.
Q: Isn’t this just budgets again?
A: Yes, but cost limits force you to count tools. Token budgets don’t.
Related pages (3–6 links)
- Foundations: How agents use tools · How LLM limits affect agents
- Failure: Budget explosion · Token overuse incidents
- Governance: Budget controls · Human approval gates
- Production stack: Production agent stack