Infinite Agent Loop: when an AI agent does not stop

Infinite loop happens when an agent keeps generating new steps without real progress. Why this happens and how it is stopped in production.
On this page
  1. Problem
  2. Why this happens
  3. Which failures happen most often
  4. Hard loop
  5. Soft loop
  6. Retry storm
  7. Semantic loop
  8. How to detect these problems
  9. How to tell failure from a truly hard task
  10. How to stop these failures
  11. Where this is implemented in architecture
  12. Self-check
  13. FAQ
  14. Related pages

Problem

The request looks simple: find order status and return a short answer.

Logs show the agent repeating the same cycle:

plan β†’ call_tool β†’ analyze β†’ plan β†’ call_tool β†’ analyze

A week ago this task type closed in 3-4 steps. Now the same request can spin for 20+ steps and end with timeout. In 15 minutes, the agent can make 60+ steps and spend around $12 on a task that usually costs about $0.08.

The system does not fail immediately.

It just slowly burns time, tokens, and money.

Analogy: imagine a navigator that says "turn around" at every intersection, even when you already did. The car is moving, but you are not getting closer to the goal. Infinite loop in agents works the same way: actions exist, progress does not.

Why this happens

LLM agents are stochastic systems. Even a small change in prompt, tool output, or context can shift step order. If runtime does not check real progress, the loop gets stuck easily.

In production, it usually looks like this:

  1. LLM proposes the next action;
  2. agent calls a tool;
  3. it gets an observation, but without a new signal;
  4. it returns to the same reasoning loop again.

Infinite loop appears not when the agent "thinks too long", but when runtime cannot distinguish useful work from repetition without progress.

Which failures happen most often

To keep this practical, in infinite-loop scenarios teams usually see four patterns.

Hard loop

The agent calls the same tool with the same arguments many times.

Typical cause: no tool+args dedupe, or unlimited repeats.

Soft loop

The agent performs the same action with minimal argument changes: for example, adds one word in search and retries.

Typical cause: no check for "did anything new appear".

Retry storm

A tool fails, and retries happen both in gateway and in the agent itself. As a result, call count multiplies.

Typical cause: retry logic spread across multiple layers without a single policy.

Semantic loop

The agent looks active, but does not move forward: rephrases plan, re-summarizes the same data, or asks again what is already known.

Typical cause: no clear progress criterion in runtime.

How to detect these problems

Infinite loop is easier to see via a combination of signals, not a single metric.

MetricLooping signalWhat to do
steps_per_tasksharp growth in steps without completionadd hard max_steps and stop reason
repeated_tool_signature_raterepeated tool+args within one runenable dedupe and repeat limit
no_progress_stepsseveral steps without new facts/artifactsstop run by no-progress window rule
stop_reason_distributionmany timeout and max_steps_reachedreview retry policy and runtime gates
tokens_per_taskcost rises while quality is flatlimit context/tool output and add progress check

How to tell failure from a truly hard task

A long run does not always mean a loop. The key question: does a new useful signal appear.

Normal if:

  • every 1-2 steps adds new facts or artifacts;
  • tool calls change meaningfully, not cosmetically;
  • agent gradually approaches final_answer.

Dangerous if:

  • 3-5 steps in a row add nothing new;
  • the same tool repeats (or the same intent repeats);
  • cost grows and answer quality does not improve.

How to stop these failures

The goal is simple: do not continue run at any cost, but finish it in a controlled way.

In practice:

  1. set hard runtime limits: max_steps, timeout, max_tool_calls, max_tokens;
  2. add tool+args dedupe and repeat limit;
  3. stop run if there is no progress for N steps;
  4. return a controlled stop reason and partial result, not a "silent" failure.

Minimal loop guard in runtime:

PYTHON
class LoopGuard:
    def __init__(self):
        self.max_steps = 12
        self.max_repeat = 3
        self.max_flat_steps = 4
        self.steps = 0
        self.flat_steps = 0
        self.seen = {}

    def on_step(self):
        self.steps += 1
        if self.steps > self.max_steps:
            return "max_steps_reached"
        return None

    def on_tool_call(self, signature: str):
        self.seen[signature] = self.seen.get(signature, 0) + 1
        if self.seen[signature] >= self.max_repeat:
            return "loop_detected:repeated_tool_signature"
        return None

    def on_progress(self, has_new_signal: bool):
        self.flat_steps = 0 if has_new_signal else self.flat_steps + 1
        if self.flat_steps >= self.max_flat_steps:
            return "loop_detected:no_progress"
        return None

Important: on each iteration call on_step() first, then on_tool_call(...), and after result analysis call on_progress(...).

This guard does not "heal" the agent. It prevents the loop from becoming a production incident.

Where this is implemented in architecture

In production systems, loop control usually lives not in the agent itself, but in separate architecture layers.

Agent Runtime handles agent execution loop: limits (max_steps, timeout, max_tokens), stop reasons, and forced run termination. This is where LoopGuard and progress checks are usually implemented.

Tool Execution Layer handles safe tool_call execution: call dedupe, retry policy, and error normalization. Many loops - retry storms, repeated tool calls, and tool spam - originate here when there is no unified retry policy or deduplication.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

Progress: 0/8

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Does switching to a stronger model solve infinite loop?
A: Sometimes it helps partially, but it does not solve the root issue. Without runtime gates, even a strong model can loop.

Q: How to choose max_steps initially?
A: Start with a conservative low limit and increase only where you see confirmed quality gain.

Q: Should retries always be used?
A: No. For 401/403 and stable validation errors, retries usually make the loop worse.

Q: What to show users when run is stopped?
A: Stop reason, what was already tried, and a partial result. This reduces repeat runs without changes.


Infinite loop almost never looks like a big outage. It is a slow degradation that eats budget and time. So production agents need not only a "smart" model, but strict runtime control.

To close this problem deeper, see:

⏱️ 6 min read β€’ Updated March 12, 2026Difficulty: β˜…β˜…β˜†
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python β€” conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.