The Problem
The request looks simple: check payment status and return a short answer to the customer.
Traces show something else: in 11 minutes one run made 33 calls,
10 returned 200, 14 returned timeout, and 9 returned 502/503.
For a task of this class it can be around ~$1.90 instead of the usual ~$0.14.
The service is formally "alive": some calls succeed, there is no full outage. But run queue grows, latency jumps, and users get unstable results.
The system does not crash.
It slowly gets stuck between rare successes and repeated failures.
Analogy: imagine a checkout where the terminal sometimes accepts cards and sometimes hangs. The store is not closed, but the line grows every minute. Partial outage in agent systems works the same way: infrastructure looks available, but a stable path to an answer no longer exists.
Why This Happens
In production, it usually goes like this:
- a dependency becomes unstable (
timeout,5xx, sometimes200); - retries start in several layers at once;
- run holds workers longer, queue grows;
- other workflows also slow down due to shared resources;
- without fail-fast and safe-mode, system multiplies spend instead of isolating failure.
In trace this appears as a mixed pattern: tool_2xx_rate is still present,
but timeout_rate, retry_attempts_per_run, and queue_backlog increase at the same time.
The problem is not one timeout.
Runtime does not switch unstable dependency into degraded mode while the failure is still local.
Most Common Failure Patterns
In production, four partial-outage patterns appear most often.
Intermittent success trap
Tool sometimes returns 200, and this masks degradation.
Agent keeps pushing the same channel instead of controlled switch.
Typical cause: no dependency "health" threshold at run level.
Retry amplification across layers
HTTP client, gateway, and runtime each run their own retries. Even a small error spike quickly turns into a wave of extra calls.
Typical cause: retry policy is not centralized.
Queue starvation by "noisy" runs
Problematic runs hang for a long time, occupy worker pool, and push healthy tasks out.
Typical cause: no run duration limits and no budget gates.
Waiting for "perfect" answer without degrade path
System tries to wait for an "ideal" result even though dependency is clearly degraded.
Typical cause: no partial/fallback contract for user.
How To Detect These Problems
Partial outage is best visible from combined health, runtime, and queue metrics.
| Metric | Partial-outage signal | What to do |
|---|---|---|
degraded_dependency_rate | one dependency often returns timeout/5xx | enable degraded mode and reduce fan-out |
tool_2xx_with_high_timeout_rate | both 200 and high timeout share appear together | add health threshold, do not rely on 200 only |
retry_attempts_per_run | suspiciously many retries per run | centralize retries and cap retry budget |
run_duration_p95 | long "hanging" runs | add fail-fast timeout and stop reasons |
queue_backlog | queue grows under normal traffic | isolate degraded path and enable fallback |
How To Distinguish Partial Outage From Full Outage
Not every degradation is a full outage. Key question: is there a stable execution path, or only random "successes".
Normal for full outage if:
- almost all calls fail uniformly (
5xxor complete unavailability); - system quickly switches to fail-fast;
- no illusion of "sometimes works".
Dangerous for partial outage if:
- same run mixes
200,timeout, and5xx; - agent repeats calls because it sees rare successes;
- queue/latency rise even without clear global incident.
How To Stop These Failures
In practice, this means:
- lock dependency health snapshot at run start;
- on threshold breach, immediately switch workflow to degraded mode;
- keep retries in one tool gateway with strict budget;
- return partial/fallback with explicit stop reason instead of "infinite" waiting.
Minimal guard for partial outage:
from dataclasses import dataclass
import time
RETRYABLE = {408, 429, 500, 502, 503, 504}
@dataclass(frozen=True)
class OutageLimits:
max_retry_per_call: int = 2
max_retry_total: int = 6
max_run_seconds: int = 45
max_tool_calls: int = 14
degraded_error_threshold: float = 0.35
min_sample_size: int = 5
class PartialOutageGuard:
def __init__(self, limits: OutageLimits = OutageLimits()):
self.limits = limits
self.started_at = time.time()
self.tool_calls = 0
self.retry_count = 0
self.total_calls = 0
self.error_calls = 0
def before_tool_call(self) -> str | None:
self.tool_calls += 1
if self.tool_calls > self.limits.max_tool_calls:
return "partial_outage:tool_call_budget"
if (time.time() - self.started_at) > self.limits.max_run_seconds:
return "partial_outage:run_timeout"
return None
def on_tool_result(self, status_code: int, attempt: int) -> str | None:
self.total_calls += 1
if status_code in RETRYABLE:
self.error_calls += 1
error_rate = self.error_calls / max(1, self.total_calls)
if (
self.total_calls >= self.limits.min_sample_size
and error_rate >= self.limits.degraded_error_threshold
):
return "partial_outage:degraded_mode"
self.retry_count += 1
if self.retry_count > self.limits.max_retry_total:
return "partial_outage:retry_budget"
if attempt >= self.limits.max_retry_per_call:
return "partial_outage:retry_exhausted"
return "partial_outage:retry_allowed"
error_rate = self.error_calls / max(1, self.total_calls)
if (
self.total_calls >= self.limits.min_sample_size
and error_rate >= self.limits.degraded_error_threshold
):
return "partial_outage:degraded_mode"
return None
In this version, retry_count tracks all retryable responses inside run,
and attempt is the retry count of one specific call.
This is a baseline guard.
In production, it is usually extended with per-tool health probes,
circuit breaker, and dedicated safe-mode path for degraded runs.
before_tool_call(...) and on_tool_result(...) are called in tool gateway,
so degradation decision is centralized instead of duplicated in each layer.
Where This Is Implemented In Architecture
In production, partial-outage control is almost always split across three system layers.
Tool Execution Layer provides health signals:
error rate, timeout patterns, retry budget, and circuit breaker.
This is where you see dependency instability even if some calls still return 200.
Agent Runtime makes run decisions: switch to degraded mode, stop reasons, controlled finish with fallback. Without this layer, system keeps waiting for "one more successful call".
Orchestration Topologies defines how to isolate degraded workflow from the rest of system (bulkheads, queues, priorities). This prevents local degradation from turning into shared incident.
Self-check
Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.
Progress: 0/8
β There are risk signals
Basic controls are missing. Close the key checklist points before release.
FAQ
Q: Why is partial outage often worse than full outage?
A: Because it is masked as "sometimes works".
System does not stop and keeps burning time, tokens, and worker pool.
Q: Should we immediately disable degraded tool?
A: Not always. Usually you switch to degraded mode: cap retries, reduce fan-out, and move to partial/fallback path.
Q: Where should retries and degradation decisions be made?
A: In one tool gateway. Otherwise each layer runs its own retries and partial outage scales quickly.
Q: What should be shown to user when dependency degrades?
A: Explicit stop reason, what exactly failed, and controlled next step: partial response or retry after dependency recovers.
Partial outage almost never looks like a loud crash. It is a quiet degradation where system still moves but no longer holds quality and pace. That is why production agents need not only retries, but strict degradation mode and dependency isolation.
Related Pages
If this issue appears in production, these pages are also useful:
- Why AI agents fail β general map of failures in production.
- Tool failure β how local tool error becomes incident.
- Deadlocks β how waiting states grow during dependency degradation.
- Cascading failures β how partial outage spreads into other workflows.
- Agent Runtime β where to manage degraded mode, stop reasons, and fallback.
- Tool Execution Layer β where retries, health signals, and circuit breaker live.