Problem
The request looks simple: find order status and return a short answer.
Logs show the agent repeating the same cycle:
plan β call_tool β analyze β plan β call_tool β analyze
A week ago this task type closed in 3-4 steps.
Now the same request can spin for 20+ steps and end with timeout.
In 15 minutes, the agent can make 60+ steps and spend around $12 on a task that usually costs about $0.08.
The system does not fail immediately.
It just slowly burns time, tokens, and money.
Analogy: imagine a navigator that says "turn around" at every intersection, even when you already did. The car is moving, but you are not getting closer to the goal. Infinite loop in agents works the same way: actions exist, progress does not.
Why this happens
LLM agents are stochastic systems. Even a small change in prompt, tool output, or context can shift step order. If runtime does not check real progress, the loop gets stuck easily.
In production, it usually looks like this:
- LLM proposes the next action;
- agent calls a
tool; - it gets an observation, but without a new signal;
- it returns to the same reasoning loop again.
Infinite loop appears not when the agent "thinks too long", but when runtime cannot distinguish useful work from repetition without progress.
Which failures happen most often
To keep this practical, in infinite-loop scenarios teams usually see four patterns.
Hard loop
The agent calls the same tool with the same arguments many times.
Typical cause: no tool+args dedupe, or unlimited repeats.
Soft loop
The agent performs the same action with minimal argument changes: for example, adds one word in search and retries.
Typical cause: no check for "did anything new appear".
Retry storm
A tool fails, and retries happen both in gateway and in the agent itself. As a result, call count multiplies.
Typical cause: retry logic spread across multiple layers without a single policy.
Semantic loop
The agent looks active, but does not move forward: rephrases plan, re-summarizes the same data, or asks again what is already known.
Typical cause: no clear progress criterion in runtime.
How to detect these problems
Infinite loop is easier to see via a combination of signals, not a single metric.
| Metric | Looping signal | What to do |
|---|---|---|
steps_per_task | sharp growth in steps without completion | add hard max_steps and stop reason |
repeated_tool_signature_rate | repeated tool+args within one run | enable dedupe and repeat limit |
no_progress_steps | several steps without new facts/artifacts | stop run by no-progress window rule |
stop_reason_distribution | many timeout and max_steps_reached | review retry policy and runtime gates |
tokens_per_task | cost rises while quality is flat | limit context/tool output and add progress check |
How to tell failure from a truly hard task
A long run does not always mean a loop. The key question: does a new useful signal appear.
Normal if:
- every 1-2 steps adds new facts or artifacts;
toolcalls change meaningfully, not cosmetically;- agent gradually approaches
final_answer.
Dangerous if:
- 3-5 steps in a row add nothing new;
- the same
toolrepeats (or the same intent repeats); - cost grows and answer quality does not improve.
How to stop these failures
The goal is simple: do not continue run at any cost, but finish it in a controlled way.
In practice:
- set hard runtime limits:
max_steps,timeout,max_tool_calls,max_tokens; - add
tool+argsdedupe and repeat limit; - stop run if there is no progress for N steps;
- return a controlled stop reason and partial result, not a "silent" failure.
Minimal loop guard in runtime:
class LoopGuard:
def __init__(self):
self.max_steps = 12
self.max_repeat = 3
self.max_flat_steps = 4
self.steps = 0
self.flat_steps = 0
self.seen = {}
def on_step(self):
self.steps += 1
if self.steps > self.max_steps:
return "max_steps_reached"
return None
def on_tool_call(self, signature: str):
self.seen[signature] = self.seen.get(signature, 0) + 1
if self.seen[signature] >= self.max_repeat:
return "loop_detected:repeated_tool_signature"
return None
def on_progress(self, has_new_signal: bool):
self.flat_steps = 0 if has_new_signal else self.flat_steps + 1
if self.flat_steps >= self.max_flat_steps:
return "loop_detected:no_progress"
return None
Important: on each iteration call on_step() first, then on_tool_call(...), and after result analysis call on_progress(...).
This guard does not "heal" the agent. It prevents the loop from becoming a production incident.
Where this is implemented in architecture
In production systems, loop control usually lives not in the agent itself, but in separate architecture layers.
Agent Runtime handles agent execution loop: limits (max_steps, timeout, max_tokens), stop reasons, and forced run termination. This is where LoopGuard and progress checks are usually implemented.
Tool Execution Layer handles safe tool_call execution: call dedupe, retry policy, and error normalization. Many loops - retry storms, repeated tool calls, and tool spam - originate here when there is no unified retry policy or deduplication.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/8
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Does switching to a stronger model solve infinite loop?
A: Sometimes it helps partially, but it does not solve the root issue. Without runtime gates, even a strong model can loop.
Q: How to choose max_steps initially?
A: Start with a conservative low limit and increase only where you see confirmed quality gain.
Q: Should retries always be used?
A: No. For 401/403 and stable validation errors, retries usually make the loop worse.
Q: What to show users when run is stopped?
A: Stop reason, what was already tried, and a partial result. This reduces repeat runs without changes.
Infinite loop almost never looks like a big outage. It is a slow degradation that eats budget and time. So production agents need not only a "smart" model, but strict runtime control.
Related pages
To close this problem deeper, see:
- Why AI agents fail - general map of production failures.
- Tool spam - how to limit duplicated tool calls.
- Budget explosion - how loop turns into uncontrolled spend.
- Agent Runtime - where to implement loop guards and stop reasons.
- Tool Execution Layer - where to keep retries, timeouts, and call validation.