Problem
In traces from one run, you can see a waiting cycle:
Agent A -> Agent B -> Agent C -> Agent A.
In 20 minutes, the number of runs in waiting can reach 40+.
Queue backlog grows, workers stay busy, and useful work is almost zero.
From the outside, everything looks "quiet": no explicit error and no service crash. But the run never finishes because all three agents are waiting for each other.
The system does not crash.
It just hangs and quietly burns resources.
Analogy: imagine three people standing in a doorway and politely letting each other pass. Nobody argues and nobody makes a "mistake", but nobody gets through. Deadlock in multi-agent systems looks exactly like this.
Why this happens
Deadlock appears not because agents "think too long", but because the system has no clear owner that should move state forward.
In production, it usually looks like this:
- agents exchange messages and depend on each other;
- one wait is delayed (tool, approval, lock);
- other agents also move into
waiting; - without timeout and workflow state ownership, the workflow gets stuck.
The problem is not one specific agent. The problem is uncontrolled coordination between agents.
Which failures happen most often
To keep it practical, deadlocks usually appear in four patterns.
Circular wait
Agent A waits for B, B waits for C, C waits for A. Everyone is "busy", but there is no progress.
Typical cause: the dependency graph has a cycle, and there is no single orchestrator.
Lock without lease (no TTL)
An agent acquires a lock on a document/ticket and crashes. Other agents wait for this lock forever.
Typical cause: lock has no lease/TTL and no owner recovery mechanism.
Unbounded waiting
You have an HTTP timeout, but no timeout on internal agent waits. The workflow can wait "forever".
Typical cause: timeouts exist at the transport level, but not at the orchestration state level.
Cross-agent retry loop
Agents hand off the task to each other with "check again", which becomes infinite ping-pong.
Typical cause: no retry cap and no stop reason for blocked-state scenarios.
How to detect these problems
Deadlock is visible through a combined view of workflow and runtime metrics.
| Metric | Deadlock signal | What to do |
|---|---|---|
waiting_runs | number of runs in waiting keeps growing | add wait timeout and stop reason for blocked state |
wait_duration_p95 | wait duration is above normal range | bound waiting time on every state transition |
blocked_transition_rate | frequent blocking between the same agents | inspect dependency graph for cycles |
lease_conflict_rate | frequent conflicts or expired leases | add TTL, renew, and recovery policy |
queue_backlog | queue grows with normal incoming traffic | clear stuck runs, enable fallback mode |
How to tell deadlock from a genuinely long task
Not every long run is a deadlock. The key criterion: are there state transitions and useful progress?
Normal if:
- workflow status changes as expected;
- after waiting, a new artifact or step appears;
- there is a clear owner of the current transition.
Dangerous if:
- run stays in the same
waitingstate for too long; - multiple agents are simultaneously waiting for each other;
- there is no clear stop reason for why the system is not moving.
How to stop these failures
In practice, it looks like this:
- introduce one owner for transitions (orchestrator or leader);
- set timeout on every
waitingstate; - use lease/TTL for shared resources;
- when there is no progress, end the run in a controlled way: stop reason + fallback.
Minimal guard for wait-state:
import time
class WaitGuard:
def __init__(self, wait_timeout_s: int = 30):
self.wait_timeout_s = wait_timeout_s
self.wait_started_at: dict[str, float] = {}
def mark_wait_start(self, run_id: str):
self.wait_started_at[run_id] = time.time()
def check_wait(self, run_id: str):
started = self.wait_started_at.get(run_id)
if started is None:
return None
if time.time() - started > self.wait_timeout_s:
return "deadlock_risk:wait_timeout"
return None
In production, mark_wait_start(...) is usually called when transitioning to waiting,
and check_wait(...) is called in a scheduler or heartbeat loop to terminate a stuck run in time.
Where this is implemented in architecture
Deadlock control in production is usually split across several layers.
Agent Runtime manages run lifecycle:
timeouts, stop reasons, forced termination of stuck states, and fallback transitions.
This is where deadlock_risk:* rules are usually applied.
Orchestration Topologies defines who owns state transitions and how agents interact without circular waiting. If topology has no clear state owner, deadlock becomes a matter of time.
Tool Execution Layer covers the technical part: lease/TTL for shared resources, one retry policy, and waiting control at tool level.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/9
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Do deadlocks happen only in large multi-agent systems?
A: No. Even 2-3 agents can create a waiting cycle if there is no explicit state owner.
Q: Is adding timeouts enough?
A: Timeouts limit hanging, but do not remove the root cause. You still need orchestrator and explicit state transitions.
Q: Do leases fully solve deadlock?
A: No. Leases solve lock problems after crashes, but do not fix logical cycles between agents.
Q: What should I do if a run is already stuck in waiting?
A: Force-stop the run with a stop reason, release lease, switch workflow to fallback, and inspect waiting chain in traces.
Deadlock almost never looks like a loud outage. More often, it is a silent progress stop that consumes workers and budget. That is why production multi-agent systems need not only role separation, but strict orchestration discipline.
Related pages
To go deeper on this problem:
- Why AI agents fail - general map of production failures.
- Multi-agent chaos - how uncontrolled agent interaction destroys stability.
- Partial outage - how partial dependency degradation triggers waiting states.
- Agent Runtime - where to manage stop reasons, timeouts, and run lifecycle.
- Orchestration Topologies - how to design controlled coordination between agents.