Problem
The request looks simple: check incident status and provide a short summary.
But traces show something else: in 8 minutes, one run spent more than 38k tokens, while similar tasks previously stayed around ~4-6k. Half of the budget went not into the answer, but into context: history, raw logs, and large tool outputs.
For this task class, it can be about ~$2.80 instead of the usual ~$0.20. And answer quality barely improved.
The system does not crash.
It just slowly inflates prompt size, latency, and cost.
Analogy: imagine a suitcase where before each trip you add "one more important thing", but never remove anything. At first, this is almost invisible. Then you spend more time, money, and effort just carrying extra weight. Token overuse in agents works exactly like that.
Why this happens
Token overuse usually comes not from the model itself, but from weak context-budget control in runtime.
In production, it usually looks like this:
- every new step adds everything: history, retrieval, tool output;
- raw payloads (logs, HTML, JSON) enter prompt without compression;
- without per-source caps and summarization, context grows each turn;
- tokens are spent on "carrying excess", not on useful progress;
- long reasoning loops without no-progress checks inflate history even more.
In trace, this usually appears as steady prompt_tokens growth,
where each new turn becomes more expensive than the previous,
even when the task itself does not become harder.
Without control, this growth starts hurting latency, cost, and quality.
Which failures happen most often
In production, teams most often see four recurring token overuse patterns.
Prompt bloat
Too much context is sent to the model "just in case".
Typical cause: no max_prompt_tokens and no chunk prioritization.
Token bombs in tool output
Tools return large payloads (HTML, logs, stack traces), which are pushed into prompt almost as-is.
Typical cause: no caps, and payload does not go through extraction or summarization before being added to prompt.
Memory/history inflation
The agent accumulates history turn by turn, but does not compress old sections. As a result, each new step gets more expensive. If run also loops for long without progress, this inflation grows even faster.
Typical cause: memory is used as an "archive", not as budgeted context.
Silent degradation via truncation
When context exceeds the real window, important prompt parts are cut. Often policy constraints or critical instructions disappear first.
Typical cause: no explicit control over what to drop and in what order.
How to detect these problems
Token overuse is visible through combined cost, latency, and context metrics.
| Metric | Token overuse signal | What to do |
|---|---|---|
prompt_tokens_per_run | steady growth of tokens per run | add max_prompt_tokens and budgeted context builder |
tool_output_tokens | large raw payloads in prompt | caps + extraction/summarization before model |
tokens_per_success | same quality, but growing cost | check unit economics and reduce unnecessary context |
context_truncation_rate | frequent prompt truncation | prioritize policy and latest facts, compress old context |
latency_p95 | latency rises with token count | reduce context and limit fan-out in retrieval or tool output |
How to distinguish overuse from a truly complex task
Not every large prompt is bad. The key question: do extra tokens create real quality gain?
Normal if:
- complex request truly needs more sources and checks;
tokens_per_successgrows together with accuracy;- extra context adds new facts instead of repeating known content.
Dangerous if:
- tokens grow faster than success rate;
- much of the context duplicates old turns or raw technical dumps;
- latency and cost grow, while answers barely change.
How to stop these failures
In practice, it looks like this:
- set hard limits:
max_prompt_tokens+ caps for history/tool/retrieval; - add context builder with priorities (policy and fresh facts above old logs);
- compress old or large fragments into a summarization tier;
- when budget is exceeded, return stop reason or partial instead of sending an "overflowed" prompt.
Minimal guard for token budget:
from dataclasses import dataclass
@dataclass(frozen=True)
class TokenLimits:
max_prompt_tokens: int = 7000
max_history_tokens: int = 1800
max_tool_tokens: int = 2500
max_retrieval_tokens: int = 2200
class TokenBudgetGuard:
def __init__(self, limits: TokenLimits = TokenLimits()):
self.limits = limits
self.total_prompt_tokens = 0
self.by_source = {
"history": 0,
"tool": 0,
"retrieval": 0,
}
def _cap_for(self, source: str) -> int:
if source == "history":
return self.limits.max_history_tokens
if source == "tool":
return self.limits.max_tool_tokens
if source == "retrieval":
return self.limits.max_retrieval_tokens
return self.limits.max_prompt_tokens
def add_chunk(self, source: str, tokens: int) -> str | None:
if self.by_source.get(source, 0) + tokens > self._cap_for(source):
return f"token_overuse:{source}_cap"
if self.total_prompt_tokens + tokens > self.limits.max_prompt_tokens:
return "token_overuse:prompt_budget_exceeded"
self.by_source[source] = self.by_source.get(source, 0) + tokens
self.total_prompt_tokens += tokens
return None
This is a baseline guard.
In production, it is usually extended with provider-accurate token counting,
a summarization tier for old chunks, and separate stop reasons for truncation.
add_chunk(...) is called before adding a fragment into prompt,
so budget works as a gate, not as a post-fact check.
Where this is implemented in architecture
In production, token overuse control is almost always split across three system layers.
Memory Layer controls what to store long-term, and what to pass into current prompt. If memory means "show everything", costs will grow.
Tool Execution Layer is responsible for normalization and compression of large payloads before they enter model context. This is where output caps are applied, where needed facts are extracted from large payloads, and where content is compressed before entering prompt.
Agent Runtime holds execution budgets:
max_prompt_tokens, stop reasons, controlled completion, and fallback on limit breach.
This is where token budget becomes a production rule, not a suggestion.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/8
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Can we just switch to a model with bigger context window?
A: You can, but it is usually slower and more expensive.
Without budget control, the problem does not disappear, it just moves to a larger limit.
Q: What is better to start with: tokens or chars?
A: Best is provider token counting. If unavailable, start with conservative char caps and move to tokens.
Q: What should be compressed first on budget overflow?
A: Usually old turns and large raw tool outputs. Policy and latest facts should stay.
Q: What should user see when prompt budget is exhausted?
A: Stop reason, what is already processed, and a safe next step: partial result, narrower request, or rerun with smaller context.
Token overuse almost never looks like a loud outage. It is a slow degradation that inflates latency and cost without obvious service crash. That is why production agents need not only better models, but strict context-budget control.
Related pages
If this issue appears in production, it also helps to review:
- Why AI agents fail - general map of production failures.
- Budget explosion - how token overuse becomes a financial incident.
- Tool spam - how unnecessary tool calls inflate context and cost.
- Context poisoning - how poor context degrades agent decisions.
- Memory Layer - where to separate long-term memory from prompt context.
- Agent Runtime - where to set token limits, stop reasons, and fallback.