Agent Deadlocks: When Agents Block Each Other

A deadlock appears when multiple agents wait for each other and the system cannot move forward. Why this happens in multi-agent systems and how to prevent it.
On this page
  1. Problem
  2. Why this happens
  3. Which failures happen most often
  4. Circular wait
  5. Lock without lease (no TTL)
  6. Unbounded waiting
  7. Cross-agent retry loop
  8. How to detect these problems
  9. How to tell deadlock from a genuinely long task
  10. How to stop these failures
  11. Where this is implemented in architecture
  12. Self-check
  13. FAQ
  14. Related pages

Problem

In traces from one run, you can see a waiting cycle: Agent A -> Agent B -> Agent C -> Agent A.

In 20 minutes, the number of runs in waiting can reach 40+. Queue backlog grows, workers stay busy, and useful work is almost zero.

From the outside, everything looks "quiet": no explicit error and no service crash. But the run never finishes because all three agents are waiting for each other.

The system does not crash.

It just hangs and quietly burns resources.

Analogy: imagine three people standing in a doorway and politely letting each other pass. Nobody argues and nobody makes a "mistake", but nobody gets through. Deadlock in multi-agent systems looks exactly like this.

Why this happens

Deadlock appears not because agents "think too long", but because the system has no clear owner that should move state forward.

In production, it usually looks like this:

  1. agents exchange messages and depend on each other;
  2. one wait is delayed (tool, approval, lock);
  3. other agents also move into waiting;
  4. without timeout and workflow state ownership, the workflow gets stuck.

The problem is not one specific agent. The problem is uncontrolled coordination between agents.

Which failures happen most often

To keep it practical, deadlocks usually appear in four patterns.

Circular wait

Agent A waits for B, B waits for C, C waits for A. Everyone is "busy", but there is no progress.

Typical cause: the dependency graph has a cycle, and there is no single orchestrator.

Lock without lease (no TTL)

An agent acquires a lock on a document/ticket and crashes. Other agents wait for this lock forever.

Typical cause: lock has no lease/TTL and no owner recovery mechanism.

Unbounded waiting

You have an HTTP timeout, but no timeout on internal agent waits. The workflow can wait "forever".

Typical cause: timeouts exist at the transport level, but not at the orchestration state level.

Cross-agent retry loop

Agents hand off the task to each other with "check again", which becomes infinite ping-pong.

Typical cause: no retry cap and no stop reason for blocked-state scenarios.

How to detect these problems

Deadlock is visible through a combined view of workflow and runtime metrics.

MetricDeadlock signalWhat to do
waiting_runsnumber of runs in waiting keeps growingadd wait timeout and stop reason for blocked state
wait_duration_p95wait duration is above normal rangebound waiting time on every state transition
blocked_transition_ratefrequent blocking between the same agentsinspect dependency graph for cycles
lease_conflict_ratefrequent conflicts or expired leasesadd TTL, renew, and recovery policy
queue_backlogqueue grows with normal incoming trafficclear stuck runs, enable fallback mode

How to tell deadlock from a genuinely long task

Not every long run is a deadlock. The key criterion: are there state transitions and useful progress?

Normal if:

  • workflow status changes as expected;
  • after waiting, a new artifact or step appears;
  • there is a clear owner of the current transition.

Dangerous if:

  • run stays in the same waiting state for too long;
  • multiple agents are simultaneously waiting for each other;
  • there is no clear stop reason for why the system is not moving.

How to stop these failures

In practice, it looks like this:

  1. introduce one owner for transitions (orchestrator or leader);
  2. set timeout on every waiting state;
  3. use lease/TTL for shared resources;
  4. when there is no progress, end the run in a controlled way: stop reason + fallback.

Minimal guard for wait-state:

PYTHON
import time


class WaitGuard:
    def __init__(self, wait_timeout_s: int = 30):
        self.wait_timeout_s = wait_timeout_s
        self.wait_started_at: dict[str, float] = {}

    def mark_wait_start(self, run_id: str):
        self.wait_started_at[run_id] = time.time()

    def check_wait(self, run_id: str):
        started = self.wait_started_at.get(run_id)
        if started is None:
            return None
        if time.time() - started > self.wait_timeout_s:
            return "deadlock_risk:wait_timeout"
        return None

In production, mark_wait_start(...) is usually called when transitioning to waiting, and check_wait(...) is called in a scheduler or heartbeat loop to terminate a stuck run in time.

Where this is implemented in architecture

Deadlock control in production is usually split across several layers.

Agent Runtime manages run lifecycle: timeouts, stop reasons, forced termination of stuck states, and fallback transitions. This is where deadlock_risk:* rules are usually applied.

Orchestration Topologies defines who owns state transitions and how agents interact without circular waiting. If topology has no clear state owner, deadlock becomes a matter of time.

Tool Execution Layer covers the technical part: lease/TTL for shared resources, one retry policy, and waiting control at tool level.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

Progress: 0/9

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Do deadlocks happen only in large multi-agent systems?
A: No. Even 2-3 agents can create a waiting cycle if there is no explicit state owner.

Q: Is adding timeouts enough?
A: Timeouts limit hanging, but do not remove the root cause. You still need orchestrator and explicit state transitions.

Q: Do leases fully solve deadlock?
A: No. Leases solve lock problems after crashes, but do not fix logical cycles between agents.

Q: What should I do if a run is already stuck in waiting?
A: Force-stop the run with a stop reason, release lease, switch workflow to fallback, and inspect waiting chain in traces.


Deadlock almost never looks like a loud outage. More often, it is a silent progress stop that consumes workers and budget. That is why production multi-agent systems need not only role separation, but strict orchestration discipline.

To go deeper on this problem:

⏱️ 6 min read β€’ Updated March 12, 2026Difficulty: β˜…β˜…β˜†
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python β€” conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.