Token Overuse: When Agents Spend Too Many Tokens

Token overuse happens when agents waste tokens on long reasoning loops or unnecessary context. Learn how to control token usage in production.
On this page
  1. Problem
  2. Why this happens
  3. Which failures happen most often
  4. Prompt bloat
  5. Token bombs in tool output
  6. Memory/history inflation
  7. Silent degradation via truncation
  8. How to detect these problems
  9. How to distinguish overuse from a truly complex task
  10. How to stop these failures
  11. Where this is implemented in architecture
  12. Self-check
  13. FAQ
  14. Related pages

Problem

The request looks simple: check incident status and provide a short summary.

But traces show something else: in 8 minutes, one run spent more than 38k tokens, while similar tasks previously stayed around ~4-6k. Half of the budget went not into the answer, but into context: history, raw logs, and large tool outputs.

For this task class, it can be about ~$2.80 instead of the usual ~$0.20. And answer quality barely improved.

The system does not crash.

It just slowly inflates prompt size, latency, and cost.

Analogy: imagine a suitcase where before each trip you add "one more important thing", but never remove anything. At first, this is almost invisible. Then you spend more time, money, and effort just carrying extra weight. Token overuse in agents works exactly like that.

Why this happens

Token overuse usually comes not from the model itself, but from weak context-budget control in runtime.

In production, it usually looks like this:

  1. every new step adds everything: history, retrieval, tool output;
  2. raw payloads (logs, HTML, JSON) enter prompt without compression;
  3. without per-source caps and summarization, context grows each turn;
  4. tokens are spent on "carrying excess", not on useful progress;
  5. long reasoning loops without no-progress checks inflate history even more.

In trace, this usually appears as steady prompt_tokens growth, where each new turn becomes more expensive than the previous, even when the task itself does not become harder.

Without control, this growth starts hurting latency, cost, and quality.

Which failures happen most often

In production, teams most often see four recurring token overuse patterns.

Prompt bloat

Too much context is sent to the model "just in case".

Typical cause: no max_prompt_tokens and no chunk prioritization.

Token bombs in tool output

Tools return large payloads (HTML, logs, stack traces), which are pushed into prompt almost as-is.

Typical cause: no caps, and payload does not go through extraction or summarization before being added to prompt.

Memory/history inflation

The agent accumulates history turn by turn, but does not compress old sections. As a result, each new step gets more expensive. If run also loops for long without progress, this inflation grows even faster.

Typical cause: memory is used as an "archive", not as budgeted context.

Silent degradation via truncation

When context exceeds the real window, important prompt parts are cut. Often policy constraints or critical instructions disappear first.

Typical cause: no explicit control over what to drop and in what order.

How to detect these problems

Token overuse is visible through combined cost, latency, and context metrics.

MetricToken overuse signalWhat to do
prompt_tokens_per_runsteady growth of tokens per runadd max_prompt_tokens and budgeted context builder
tool_output_tokenslarge raw payloads in promptcaps + extraction/summarization before model
tokens_per_successsame quality, but growing costcheck unit economics and reduce unnecessary context
context_truncation_ratefrequent prompt truncationprioritize policy and latest facts, compress old context
latency_p95latency rises with token countreduce context and limit fan-out in retrieval or tool output

How to distinguish overuse from a truly complex task

Not every large prompt is bad. The key question: do extra tokens create real quality gain?

Normal if:

  • complex request truly needs more sources and checks;
  • tokens_per_success grows together with accuracy;
  • extra context adds new facts instead of repeating known content.

Dangerous if:

  • tokens grow faster than success rate;
  • much of the context duplicates old turns or raw technical dumps;
  • latency and cost grow, while answers barely change.

How to stop these failures

In practice, it looks like this:

  1. set hard limits: max_prompt_tokens + caps for history/tool/retrieval;
  2. add context builder with priorities (policy and fresh facts above old logs);
  3. compress old or large fragments into a summarization tier;
  4. when budget is exceeded, return stop reason or partial instead of sending an "overflowed" prompt.

Minimal guard for token budget:

PYTHON
from dataclasses import dataclass


@dataclass(frozen=True)
class TokenLimits:
    max_prompt_tokens: int = 7000
    max_history_tokens: int = 1800
    max_tool_tokens: int = 2500
    max_retrieval_tokens: int = 2200


class TokenBudgetGuard:
    def __init__(self, limits: TokenLimits = TokenLimits()):
        self.limits = limits
        self.total_prompt_tokens = 0
        self.by_source = {
            "history": 0,
            "tool": 0,
            "retrieval": 0,
        }

    def _cap_for(self, source: str) -> int:
        if source == "history":
            return self.limits.max_history_tokens
        if source == "tool":
            return self.limits.max_tool_tokens
        if source == "retrieval":
            return self.limits.max_retrieval_tokens
        return self.limits.max_prompt_tokens

    def add_chunk(self, source: str, tokens: int) -> str | None:
        if self.by_source.get(source, 0) + tokens > self._cap_for(source):
            return f"token_overuse:{source}_cap"

        if self.total_prompt_tokens + tokens > self.limits.max_prompt_tokens:
            return "token_overuse:prompt_budget_exceeded"

        self.by_source[source] = self.by_source.get(source, 0) + tokens
        self.total_prompt_tokens += tokens
        return None

This is a baseline guard. In production, it is usually extended with provider-accurate token counting, a summarization tier for old chunks, and separate stop reasons for truncation. add_chunk(...) is called before adding a fragment into prompt, so budget works as a gate, not as a post-fact check.

Where this is implemented in architecture

In production, token overuse control is almost always split across three system layers.

Memory Layer controls what to store long-term, and what to pass into current prompt. If memory means "show everything", costs will grow.

Tool Execution Layer is responsible for normalization and compression of large payloads before they enter model context. This is where output caps are applied, where needed facts are extracted from large payloads, and where content is compressed before entering prompt.

Agent Runtime holds execution budgets: max_prompt_tokens, stop reasons, controlled completion, and fallback on limit breach. This is where token budget becomes a production rule, not a suggestion.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

Progress: 0/8

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Can we just switch to a model with bigger context window?
A: You can, but it is usually slower and more expensive. Without budget control, the problem does not disappear, it just moves to a larger limit.

Q: What is better to start with: tokens or chars?
A: Best is provider token counting. If unavailable, start with conservative char caps and move to tokens.

Q: What should be compressed first on budget overflow?
A: Usually old turns and large raw tool outputs. Policy and latest facts should stay.

Q: What should user see when prompt budget is exhausted?
A: Stop reason, what is already processed, and a safe next step: partial result, narrower request, or rerun with smaller context.


Token overuse almost never looks like a loud outage. It is a slow degradation that inflates latency and cost without obvious service crash. That is why production agents need not only better models, but strict context-budget control.

If this issue appears in production, it also helps to review:

⏱️ 7 min read β€’ Updated March 12, 2026Difficulty: β˜…β˜…β˜†
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python β€” conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.