Problem
03:07 at night.
The on-call engineer sees that the agent got a normal request, called a tool several times, received results, but still could not finish the task.
In the logs, the same chain repeats:
plan β call_tool β analyze β plan β call_tool β analyze
Many steps, many tokens, no result.
The agent tries to find an email in the CRM. Search returns 404, but instead of replying "not found," the agent starts changing the query:
john@example.comJohn@example.comJOHN@example.comjohn@company.com
In 2 minutes, the agent made 47 API calls, spent around $5 on tokens, and still got no closer to an answer.
A runaway loop can burn through a budget planned for a full week in just 30-40 minutes.
Analogy: imagine a cashier who keeps scanning the same item forever, ignoring the customer's card limit. They can look "busy," but every next action only increases the loss. For AI agents, runtime is this control layer: stop conditions, budgets, and policy gates.
Why this happens
In production:
- The LLM proposes the next step;
- a tool call is executed;
- the result goes back into the reasoning loop;
- runtime does not check real progress and does not stop the loop in time.
The issue is not the model itself. The issue is that runtime does not know when to stop.
Which failures happen most often
To keep this practical, production teams usually start with three main failure types.
Loop failures
The agent repeats the same steps without new observations. From the outside it may look like it is still working, but in reality the system just spins in circles and burns time and money.
Typical cause: missing max_steps, no progress check, or no clear stop reason.
In production, this usually looks like an infinite loop.
Tool failures
The agent calls tools too often or incorrectly. Latency and API load grow, and failures start to spread across service chains.
Typical cause: a too-permissive Tool Execution Layer and weak argument validation.
This often turns into a tool failure.
Budget failures
Token and time budget grows without visible progress. As a result, the system gets more expensive, and dependent services hit timeouts more often.
Typical cause: no execution budgets for steps, tokens, time, and number of tool calls.
Without limits, this often escalates into budget explosion.
Context drift
When an agent runs for too long, message history grows. New tokens can push out the system prompt, and the agent starts to "forget" its role or original task. This is context drift. It is usually mitigated with summarization and context window limits; a close symptom pattern is context poisoning.
It is also worth tracking two more classes:
Security failures: prompt injection and unauthorized access to write tools.Data failures: incorrect or unvalidated intermediate data that breaks the final answer.
How to detect these problems
To catch these failures before they become incidents, production systems usually monitor a small set of key metrics.
| Metric | Signal | What to do |
|---|---|---|
steps_per_task | sudden spike in iterations | review stop conditions, add progress check |
tool_calls_per_task | suspiciously many repeats | add tool+args dedupe and call limits |
tokens_per_task | usage grows without progress | limit context window size, add summarization and tool output caps |
runtime_duration | latency rises, task stalls | set timeout and force run termination |
How to tell failure from a genuinely hard task
Not every long run is a failure. The key signal is not step count, but the absence of real progress.
Normal case:
toolsteps change observations;- new data appears;
- the result gets closer to
final_answer.
Dangerous case:
- repeats without new observations;
- same
tool_callwith unchanged arguments; - cost rises, but result quality does not improve.
How to stop these failures
The simplest way to control an execution loop is runtime limits. Usually these are max_steps, max_tool_calls, max_tokens, and timeout.
max_steps is the first emergency brake against runaway loops. A more advanced option is a semantic progress check: a separate small model (for example, Gemini Flash or Claude Haiku) analyzes the last 3 agent steps and checks whether a new signal appeared or the system is just circling. The output can look like this:
{
"is_progressing": true,
"is_looping": false
}
A basic runtime skeleton that blocks most runaway loops:
class RunLimits:
def __init__(self):
self.max_steps = 8
self.max_tool_calls = 12
self.max_tokens = 4000
self.max_seconds = 30
self.steps = 0
self.tool_calls = 0
def check(self, step_tokens: int, elapsed_ms: int) -> str | None:
self.steps += 1
self.max_tokens -= step_tokens
if self.steps > self.max_steps:
return "max_steps_reached"
if self.tool_calls > self.max_tool_calls:
return "max_tool_calls_reached"
if self.max_tokens <= 0:
return "max_tokens_reached"
if elapsed_ms > self.max_seconds * 1000:
return "timeout"
return None
def register_tool(self):
self.tool_calls += 1
In production, these limits are often kept in Redis to enforce them across stateless workers.
But limits alone do not guarantee correct behavior. They only stop runaway loops. For stable agent behavior, you also need tool output validation, policy boundaries, and control over write actions.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/8
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Why do AI agents fail more often than regular workflows?
A: In workflows, steps are fixed in advance. In agents, the LLM proposes the next step dynamically, and without runtime boundaries the loop quickly gets out of control.
Q: Will switching to a stronger model solve this?
A: It can help partly, but it does not solve the root issue. Without runtime control, even a strong model can loop, exceed budget, or spam tools.
If the agent from the start of this article had max_steps = 8 and tool+args dedupe, the 03:07 incident would have ended in seconds.
In production, agent stability is defined not by the model, but by the boundaries runtime puts around the execution loop.
Related pages
To better understand how to prevent these failures, look at the system layers that control agent behavior:
- Agent Runtime - controls the agent loop, limits, and stop reasons.
- Tool Execution Layer - executes
tool_callsafely via validation, policy, and timeout. - Policy Boundaries - defines which actions are allowed and which are blocked by default.
- Memory Layer - helps keep state clean so the agent does not repeat steps without progress.
It is also useful to jump to focused failure scenarios:
- Infinite loop - how to detect and stop repeats without progress.
- Tool spam - how to limit duplicated tool calls.
- Budget explosion - how to control token spend and API budget.