Agent Deadlocks: When Agents Block Each Other

A deadlock appears when multiple agents wait for each other and the system cannot move forward. Why this happens in multi-agent systems and how to prevent it.

On this page

Problem
Why this happens
Which failures happen most often
Circular wait
Lock without lease (no TTL)
Unbounded waiting
Cross-agent retry loop
How to detect these problems
How to tell deadlock from a genuinely long task
How to stop these failures
Where this is implemented in architecture
Self-check
FAQ
Related pages

Problem

In traces from one run, you can see a waiting cycle: Agent A -> Agent B -> Agent C -> Agent A.

In 20 minutes, the number of runs in waiting can reach 40+. Queue backlog grows, workers stay busy, and useful work is almost zero.

From the outside, everything looks "quiet": no explicit error and no service crash. But the run never finishes because all three agents are waiting for each other.

The system does not crash.

It just hangs and quietly burns resources.

Analogy: imagine three people standing in a doorway and politely letting each other pass. Nobody argues and nobody makes a "mistake", but nobody gets through. Deadlock in multi-agent systems looks exactly like this.

Why this happens

Deadlock appears not because agents "think too long", but because the system has no clear owner that should move state forward.

In production, it usually looks like this:

agents exchange messages and depend on each other;
one wait is delayed (tool, approval, lock);
other agents also move into waiting;
without timeout and workflow state ownership, the workflow gets stuck.

The problem is not one specific agent. The problem is uncontrolled coordination between agents.

Which failures happen most often

To keep it practical, deadlocks usually appear in four patterns.

Circular wait

Agent A waits for B, B waits for C, C waits for A. Everyone is "busy", but there is no progress.

Typical cause: the dependency graph has a cycle, and there is no single orchestrator.

Lock without lease (no TTL)

An agent acquires a lock on a document/ticket and crashes. Other agents wait for this lock forever.

Typical cause: lock has no lease/TTL and no owner recovery mechanism.

Unbounded waiting

You have an HTTP timeout, but no timeout on internal agent waits. The workflow can wait "forever".

Typical cause: timeouts exist at the transport level, but not at the orchestration state level.

Cross-agent retry loop

Agents hand off the task to each other with "check again", which becomes infinite ping-pong.

Typical cause: no retry cap and no stop reason for blocked-state scenarios.

How to detect these problems

Deadlock is visible through a combined view of workflow and runtime metrics.

Metric	Deadlock signal	What to do
`waiting_runs`	number of runs in `waiting` keeps growing	add wait timeout and stop reason for blocked state
`wait_duration_p95`	wait duration is above normal range	bound waiting time on every state transition
`blocked_transition_rate`	frequent blocking between the same agents	inspect dependency graph for cycles
`lease_conflict_rate`	frequent conflicts or expired leases	add TTL, renew, and recovery policy
`queue_backlog`	queue grows with normal incoming traffic	clear stuck runs, enable fallback mode

How to tell deadlock from a genuinely long task

Not every long run is a deadlock. The key criterion: are there state transitions and useful progress?

Normal if:

workflow status changes as expected;
after waiting, a new artifact or step appears;
there is a clear owner of the current transition.

Dangerous if:

run stays in the same waiting state for too long;
multiple agents are simultaneously waiting for each other;
there is no clear stop reason for why the system is not moving.

How to stop these failures

In practice, it looks like this:

introduce one owner for transitions (orchestrator or leader);
set timeout on every waiting state;
use lease/TTL for shared resources;
when there is no progress, end the run in a controlled way: stop reason + fallback.

Minimal guard for wait-state:

PYTHON

import time


class WaitGuard:
    def __init__(self, wait_timeout_s: int = 30):
        self.wait_timeout_s = wait_timeout_s
        self.wait_started_at: dict[str, float] = {}

    def mark_wait_start(self, run_id: str):
        self.wait_started_at[run_id] = time.time()

    def check_wait(self, run_id: str):
        started = self.wait_started_at.get(run_id)
        if started is None:
            return None
        if time.time() - started > self.wait_timeout_s:
            return "deadlock_risk:wait_timeout"
        return None

In production, mark_wait_start(...) is usually called when transitioning to waiting, and check_wait(...) is called in a scheduler or heartbeat loop to terminate a stuck run in time.

Where this is implemented in architecture

Deadlock control in production is usually split across several layers.

Agent Runtime manages run lifecycle: timeouts, stop reasons, forced termination of stuck states, and fallback transitions. This is where deadlock_risk:* rules are usually applied.

Orchestration Topologies defines who owns state transitions and how agents interact without circular waiting. If topology has no clear state owner, deadlock becomes a matter of time.

Tool Execution Layer covers the technical part: lease/TTL for shared resources, one retry policy, and waiting control at tool level.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

One owner is responsible for state transitions
Each wait state has a timeout
Shared resources have lease or TTL
Fan-out between agents is limited
Stop reasons cover blocked and timeout states
There are alerts for waiting_runs, wait_duration, and blocked transitions
There is a fallback mode (for example, single-agent)
Write actions during recovery are idempotent
There is a trace of waiting chains between agents

Progress: 0/9

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Do deadlocks happen only in large multi-agent systems?
A: No. Even 2-3 agents can create a waiting cycle if there is no explicit state owner.

Q: Is adding timeouts enough?
A: Timeouts limit hanging, but do not remove the root cause. You still need orchestrator and explicit state transitions.

Q: Do leases fully solve deadlock?
A: No. Leases solve lock problems after crashes, but do not fix logical cycles between agents.

Q: What should I do if a run is already stuck in waiting?
A: Force-stop the run with a stop reason, release lease, switch workflow to fallback, and inspect waiting chain in traces.

Deadlock almost never looks like a loud outage. More often, it is a silent progress stop that consumes workers and budget. That is why production multi-agent systems need not only role separation, but strict orchestration discipline.

To go deeper on this problem:

Why AI agents fail - general map of production failures.
Multi-agent chaos - how uncontrolled agent interaction destroys stability.
Partial outage - how partial dependency degradation triggers waiting states.
Agent Runtime - where to manage stop reasons, timeouts, and run lifecycle.
Orchestration Topologies - how to design controlled coordination between agents.

Used by patterns

Related failures

Governance required

Implement in OnceOnly

Guardrails for loops, retries, and spend escalation.

Use in OnceOnly

# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }

Integrated: production controlOnceOnly

Add guardrails to tool-calling agents

Ship this pattern with governance:

Budgets (steps / spend caps)
Kill switch & incident stop
Audit logs & traceability
Idempotency & dedupe
Tool permissions (allowlist / blocklist)

Try OnceOnly Docs & examples

Integrated mention: OnceOnly is a control layer for production agent systems.

Example policy (concept)

# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick — engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

🔗 GitHub: https://github.com/mykolademyanov

Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.