Infinite Agent Loop: when an AI agent does not stop

Infinite loop happens when an agent keeps generating new steps without real progress. Why this happens and how it is stopped in production.

On this page

Problem
Why this happens
Which failures happen most often
Hard loop
Soft loop
Retry storm
Semantic loop
How to detect these problems
How to tell failure from a truly hard task
How to stop these failures
Where this is implemented in architecture
Self-check
FAQ
Related pages

Problem

The request looks simple: find order status and return a short answer.

Logs show the agent repeating the same cycle:

plan → call_tool → analyze → plan → call_tool → analyze

A week ago this task type closed in 3-4 steps. Now the same request can spin for 20+ steps and end with timeout. In 15 minutes, the agent can make 60+ steps and spend around $12 on a task that usually costs about $0.08.

The system does not fail immediately.

It just slowly burns time, tokens, and money.

Analogy: imagine a navigator that says "turn around" at every intersection, even when you already did. The car is moving, but you are not getting closer to the goal. Infinite loop in agents works the same way: actions exist, progress does not.

Why this happens

LLM agents are stochastic systems. Even a small change in prompt, tool output, or context can shift step order. If runtime does not check real progress, the loop gets stuck easily.

In production, it usually looks like this:

LLM proposes the next action;
agent calls a tool;
it gets an observation, but without a new signal;
it returns to the same reasoning loop again.

Infinite loop appears not when the agent "thinks too long", but when runtime cannot distinguish useful work from repetition without progress.

Which failures happen most often

To keep this practical, in infinite-loop scenarios teams usually see four patterns.

Hard loop

The agent calls the same tool with the same arguments many times.

Typical cause: no tool+args dedupe, or unlimited repeats.

Soft loop

The agent performs the same action with minimal argument changes: for example, adds one word in search and retries.

Typical cause: no check for "did anything new appear".

Retry storm

A tool fails, and retries happen both in gateway and in the agent itself. As a result, call count multiplies.

Typical cause: retry logic spread across multiple layers without a single policy.

Semantic loop

The agent looks active, but does not move forward: rephrases plan, re-summarizes the same data, or asks again what is already known.

Typical cause: no clear progress criterion in runtime.

How to detect these problems

Infinite loop is easier to see via a combination of signals, not a single metric.

Metric	Looping signal	What to do
`steps_per_task`	sharp growth in steps without completion	add hard `max_steps` and stop reason
`repeated_tool_signature_rate`	repeated `tool+args` within one run	enable dedupe and repeat limit
`no_progress_steps`	several steps without new facts/artifacts	stop run by no-progress window rule
`stop_reason_distribution`	many `timeout` and `max_steps_reached`	review retry policy and runtime gates
`tokens_per_task`	cost rises while quality is flat	limit context/tool output and add progress check

How to tell failure from a truly hard task

A long run does not always mean a loop. The key question: does a new useful signal appear.

Normal if:

every 1-2 steps adds new facts or artifacts;
tool calls change meaningfully, not cosmetically;
agent gradually approaches final_answer.

Dangerous if:

3-5 steps in a row add nothing new;
the same tool repeats (or the same intent repeats);
cost grows and answer quality does not improve.

How to stop these failures

The goal is simple: do not continue run at any cost, but finish it in a controlled way.

In practice:

set hard runtime limits: max_steps, timeout, max_tool_calls, max_tokens;
add tool+args dedupe and repeat limit;
stop run if there is no progress for N steps;
return a controlled stop reason and partial result, not a "silent" failure.

Minimal loop guard in runtime:

PYTHON

class LoopGuard:
    def __init__(self):
        self.max_steps = 12
        self.max_repeat = 3
        self.max_flat_steps = 4
        self.steps = 0
        self.flat_steps = 0
        self.seen = {}

    def on_step(self):
        self.steps += 1
        if self.steps > self.max_steps:
            return "max_steps_reached"
        return None

    def on_tool_call(self, signature: str):
        self.seen[signature] = self.seen.get(signature, 0) + 1
        if self.seen[signature] >= self.max_repeat:
            return "loop_detected:repeated_tool_signature"
        return None

    def on_progress(self, has_new_signal: bool):
        self.flat_steps = 0 if has_new_signal else self.flat_steps + 1
        if self.flat_steps >= self.max_flat_steps:
            return "loop_detected:no_progress"
        return None

Important: on each iteration call on_step() first, then on_tool_call(...), and after result analysis call on_progress(...).

This guard does not "heal" the agent. It prevents the loop from becoming a production incident.

Where this is implemented in architecture

In production systems, loop control usually lives not in the agent itself, but in separate architecture layers.

Agent Runtime handles agent execution loop: limits (max_steps, timeout, max_tokens), stop reasons, and forced run termination. This is where LoopGuard and progress checks are usually implemented.

Tool Execution Layer handles safe tool_call execution: call dedupe, retry policy, and error normalization. Many loops - retry storms, repeated tool calls, and tool spam - originate here when there is no unified retry policy or deduplication.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

There are limits: max_steps, timeout, max_tool_calls, and max_tokens
There is dedupe: the same tool+args cannot loop repeatedly
There is a no-progress rule: stop after N steps without a new signal
Retry policy is configured in one place (gateway), not multiple places
Forced stops always have a clear stop_reason
Users get a partial result when a run is stopped
Operators have a kill switch
There are alerts for steps_per_task, repeated calls, and timeout

Progress: 0/8

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Does switching to a stronger model solve infinite loop?
A: Sometimes it helps partially, but it does not solve the root issue. Without runtime gates, even a strong model can loop.

Q: How to choose max_steps initially?
A: Start with a conservative low limit and increase only where you see confirmed quality gain.

Q: Should retries always be used?
A: No. For 401/403 and stable validation errors, retries usually make the loop worse.

Q: What to show users when run is stopped?
A: Stop reason, what was already tried, and a partial result. This reduces repeat runs without changes.

Infinite loop almost never looks like a big outage. It is a slow degradation that eats budget and time. So production agents need not only a "smart" model, but strict runtime control.

To close this problem deeper, see:

Why AI agents fail - general map of production failures.
Tool spam - how to limit duplicated tool calls.
Budget explosion - how loop turns into uncontrolled spend.
Agent Runtime - where to implement loop guards and stop reasons.
Tool Execution Layer - where to keep retries, timeouts, and call validation.

Used by patterns

Related failures

Governance required

Implement in OnceOnly

Guardrails for loops, retries, and spend escalation.

Use in OnceOnly

# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }

Integrated: production controlOnceOnly

Add guardrails to tool-calling agents

Ship this pattern with governance:

Budgets (steps / spend caps)
Kill switch & incident stop
Audit logs & traceability
Idempotency & dedupe
Tool permissions (allowlist / blocklist)

Try OnceOnly Docs & examples

Integrated mention: OnceOnly is a control layer for production agent systems.

Example policy (concept)

# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick — engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

🔗 GitHub: https://github.com/mykolademyanov

Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.