Partial Outage: When Part of the Agent System Fails

Partial outages happen when only part of an agent system stops working while the rest remains available. Learn how this breaks pipelines and user flows.

On this page

The Problem
Why This Happens
Most Common Failure Patterns
Intermittent success trap
Retry amplification across layers
Queue starvation by "noisy" runs
Waiting for "perfect" answer without degrade path
How To Detect These Problems
How To Distinguish Partial Outage From Full Outage
How To Stop These Failures
Where This Is Implemented In Architecture
Self-check
FAQ
Related Pages

The Problem

The request looks simple: check payment status and return a short answer to the customer.

Traces show something else: in 11 minutes one run made 33 calls, 10 returned 200, 14 returned timeout, and 9 returned 502/503. For a task of this class it can be around ~$1.90 instead of the usual ~$0.14.

The service is formally "alive": some calls succeed, there is no full outage. But run queue grows, latency jumps, and users get unstable results.

The system does not crash.

It slowly gets stuck between rare successes and repeated failures.

Analogy: imagine a checkout where the terminal sometimes accepts cards and sometimes hangs. The store is not closed, but the line grows every minute. Partial outage in agent systems works the same way: infrastructure looks available, but a stable path to an answer no longer exists.

Why This Happens

In production, it usually goes like this:

a dependency becomes unstable (timeout, 5xx, sometimes 200);
retries start in several layers at once;
run holds workers longer, queue grows;
other workflows also slow down due to shared resources;
without fail-fast and safe-mode, system multiplies spend instead of isolating failure.

In trace this appears as a mixed pattern: tool_2xx_rate is still present, but timeout_rate, retry_attempts_per_run, and queue_backlog increase at the same time.

The problem is not one timeout.

Runtime does not switch unstable dependency into degraded mode while the failure is still local.

Most Common Failure Patterns

In production, four partial-outage patterns appear most often.

Intermittent success trap

Tool sometimes returns 200, and this masks degradation. Agent keeps pushing the same channel instead of controlled switch.

Typical cause: no dependency "health" threshold at run level.

Retry amplification across layers

HTTP client, gateway, and runtime each run their own retries. Even a small error spike quickly turns into a wave of extra calls.

Typical cause: retry policy is not centralized.

Queue starvation by "noisy" runs

Problematic runs hang for a long time, occupy worker pool, and push healthy tasks out.

Typical cause: no run duration limits and no budget gates.

Waiting for "perfect" answer without degrade path

System tries to wait for an "ideal" result even though dependency is clearly degraded.

Typical cause: no partial/fallback contract for user.

How To Detect These Problems

Partial outage is best visible from combined health, runtime, and queue metrics.

Metric	Partial-outage signal	What to do
`degraded_dependency_rate`	one dependency often returns `timeout/5xx`	enable degraded mode and reduce fan-out
`tool_2xx_with_high_timeout_rate`	both `200` and high timeout share appear together	add health threshold, do not rely on `200` only
`retry_attempts_per_run`	suspiciously many retries per run	centralize retries and cap retry budget
`run_duration_p95`	long "hanging" runs	add fail-fast timeout and stop reasons
`queue_backlog`	queue grows under normal traffic	isolate degraded path and enable fallback

How To Distinguish Partial Outage From Full Outage

Not every degradation is a full outage. Key question: is there a stable execution path, or only random "successes".

Normal for full outage if:

almost all calls fail uniformly (5xx or complete unavailability);
system quickly switches to fail-fast;
no illusion of "sometimes works".

Dangerous for partial outage if:

same run mixes 200, timeout, and 5xx;
agent repeats calls because it sees rare successes;
queue/latency rise even without clear global incident.

How To Stop These Failures

In practice, this means:

lock dependency health snapshot at run start;
on threshold breach, immediately switch workflow to degraded mode;
keep retries in one tool gateway with strict budget;
return partial/fallback with explicit stop reason instead of "infinite" waiting.

Minimal guard for partial outage:

PYTHON

from dataclasses import dataclass
import time


RETRYABLE = {408, 429, 500, 502, 503, 504}


@dataclass(frozen=True)
class OutageLimits:
    max_retry_per_call: int = 2
    max_retry_total: int = 6
    max_run_seconds: int = 45
    max_tool_calls: int = 14
    degraded_error_threshold: float = 0.35
    min_sample_size: int = 5


class PartialOutageGuard:
    def __init__(self, limits: OutageLimits = OutageLimits()):
        self.limits = limits
        self.started_at = time.time()
        self.tool_calls = 0
        self.retry_count = 0
        self.total_calls = 0
        self.error_calls = 0

    def before_tool_call(self) -> str | None:
        self.tool_calls += 1
        if self.tool_calls > self.limits.max_tool_calls:
            return "partial_outage:tool_call_budget"
        if (time.time() - self.started_at) > self.limits.max_run_seconds:
            return "partial_outage:run_timeout"
        return None

    def on_tool_result(self, status_code: int, attempt: int) -> str | None:
        self.total_calls += 1

        if status_code in RETRYABLE:
            self.error_calls += 1
            error_rate = self.error_calls / max(1, self.total_calls)
            if (
                self.total_calls >= self.limits.min_sample_size
                and error_rate >= self.limits.degraded_error_threshold
            ):
                return "partial_outage:degraded_mode"
            self.retry_count += 1
            if self.retry_count > self.limits.max_retry_total:
                return "partial_outage:retry_budget"
            if attempt >= self.limits.max_retry_per_call:
                return "partial_outage:retry_exhausted"
            return "partial_outage:retry_allowed"

        error_rate = self.error_calls / max(1, self.total_calls)
        if (
            self.total_calls >= self.limits.min_sample_size
            and error_rate >= self.limits.degraded_error_threshold
        ):
            return "partial_outage:degraded_mode"

        return None

In this version, retry_count tracks all retryable responses inside run, and attempt is the retry count of one specific call.

This is a baseline guard. In production, it is usually extended with per-tool health probes, circuit breaker, and dedicated safe-mode path for degraded runs. before_tool_call(...) and on_tool_result(...) are called in tool gateway, so degradation decision is centralized instead of duplicated in each layer.

Where This Is Implemented In Architecture

In production, partial-outage control is almost always split across three system layers.

Tool Execution Layer provides health signals: error rate, timeout patterns, retry budget, and circuit breaker. This is where you see dependency instability even if some calls still return 200.

Agent Runtime makes run decisions: switch to degraded mode, stop reasons, controlled finish with fallback. Without this layer, system keeps waiting for "one more successful call".

Orchestration Topologies defines how to isolate degraded workflow from the rest of system (bulkheads, queues, priorities). This prevents local degradation from turning into shared incident.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

A dependency health snapshot is captured at run start
Retries are configured in one tool gateway
There is a threshold for switching to degraded mode
There are run limits: max_run_seconds, max_tool_calls, and retry budget
Fallback or partial response is defined in advance
Stop reasons cover partial_outage
There are alerts for degraded dependencies, retries, and backlog
Degraded workflows are isolated (bulkhead or queue policy)

Progress: 0/8

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Why is partial outage often worse than full outage?
A: Because it is masked as "sometimes works". System does not stop and keeps burning time, tokens, and worker pool.

Q: Should we immediately disable degraded tool?
A: Not always. Usually you switch to degraded mode: cap retries, reduce fan-out, and move to partial/fallback path.

Q: Where should retries and degradation decisions be made?
A: In one tool gateway. Otherwise each layer runs its own retries and partial outage scales quickly.

Q: What should be shown to user when dependency degrades?
A: Explicit stop reason, what exactly failed, and controlled next step: partial response or retry after dependency recovers.

Partial outage almost never looks like a loud crash. It is a quiet degradation where system still moves but no longer holds quality and pace. That is why production agents need not only retries, but strict degradation mode and dependency isolation.

If this issue appears in production, these pages are also useful:

Why AI agents fail — general map of failures in production.
Tool failure — how local tool error becomes incident.
Deadlocks — how waiting states grow during dependency degradation.
Cascading failures — how partial outage spreads into other workflows.
Agent Runtime — where to manage degraded mode, stop reasons, and fallback.
Tool Execution Layer — where retries, health signals, and circuit breaker live.

Used by patterns

Related failures

Governance required

Implement in OnceOnly

Guardrails for loops, retries, and spend escalation.

Use in OnceOnly

# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }

Integrated: production controlOnceOnly

Add guardrails to tool-calling agents

Ship this pattern with governance:

Budgets (steps / spend caps)
Kill switch & incident stop
Audit logs & traceability
Idempotency & dedupe
Tool permissions (allowlist / blocklist)

Try OnceOnly Docs & examples

Integrated mention: OnceOnly is a control layer for production agent systems.

Example policy (concept)

# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick — engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

🔗 GitHub: https://github.com/mykolademyanov

Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.