Partial Outage: When Part of the Agent System Fails

Partial outages happen when only part of an agent system stops working while the rest remains available. Learn how this breaks pipelines and user flows.
On this page
  1. The Problem
  2. Why This Happens
  3. Most Common Failure Patterns
  4. Intermittent success trap
  5. Retry amplification across layers
  6. Queue starvation by "noisy" runs
  7. Waiting for "perfect" answer without degrade path
  8. How To Detect These Problems
  9. How To Distinguish Partial Outage From Full Outage
  10. How To Stop These Failures
  11. Where This Is Implemented In Architecture
  12. Self-check
  13. FAQ
  14. Related Pages

The Problem

The request looks simple: check payment status and return a short answer to the customer.

Traces show something else: in 11 minutes one run made 33 calls, 10 returned 200, 14 returned timeout, and 9 returned 502/503. For a task of this class it can be around ~$1.90 instead of the usual ~$0.14.

The service is formally "alive": some calls succeed, there is no full outage. But run queue grows, latency jumps, and users get unstable results.

The system does not crash.

It slowly gets stuck between rare successes and repeated failures.

Analogy: imagine a checkout where the terminal sometimes accepts cards and sometimes hangs. The store is not closed, but the line grows every minute. Partial outage in agent systems works the same way: infrastructure looks available, but a stable path to an answer no longer exists.

Why This Happens

In production, it usually goes like this:

  1. a dependency becomes unstable (timeout, 5xx, sometimes 200);
  2. retries start in several layers at once;
  3. run holds workers longer, queue grows;
  4. other workflows also slow down due to shared resources;
  5. without fail-fast and safe-mode, system multiplies spend instead of isolating failure.

In trace this appears as a mixed pattern: tool_2xx_rate is still present, but timeout_rate, retry_attempts_per_run, and queue_backlog increase at the same time.

The problem is not one timeout.

Runtime does not switch unstable dependency into degraded mode while the failure is still local.

Most Common Failure Patterns

In production, four partial-outage patterns appear most often.

Intermittent success trap

Tool sometimes returns 200, and this masks degradation. Agent keeps pushing the same channel instead of controlled switch.

Typical cause: no dependency "health" threshold at run level.

Retry amplification across layers

HTTP client, gateway, and runtime each run their own retries. Even a small error spike quickly turns into a wave of extra calls.

Typical cause: retry policy is not centralized.

Queue starvation by "noisy" runs

Problematic runs hang for a long time, occupy worker pool, and push healthy tasks out.

Typical cause: no run duration limits and no budget gates.

Waiting for "perfect" answer without degrade path

System tries to wait for an "ideal" result even though dependency is clearly degraded.

Typical cause: no partial/fallback contract for user.

How To Detect These Problems

Partial outage is best visible from combined health, runtime, and queue metrics.

MetricPartial-outage signalWhat to do
degraded_dependency_rateone dependency often returns timeout/5xxenable degraded mode and reduce fan-out
tool_2xx_with_high_timeout_rateboth 200 and high timeout share appear togetheradd health threshold, do not rely on 200 only
retry_attempts_per_runsuspiciously many retries per runcentralize retries and cap retry budget
run_duration_p95long "hanging" runsadd fail-fast timeout and stop reasons
queue_backlogqueue grows under normal trafficisolate degraded path and enable fallback

How To Distinguish Partial Outage From Full Outage

Not every degradation is a full outage. Key question: is there a stable execution path, or only random "successes".

Normal for full outage if:

  • almost all calls fail uniformly (5xx or complete unavailability);
  • system quickly switches to fail-fast;
  • no illusion of "sometimes works".

Dangerous for partial outage if:

  • same run mixes 200, timeout, and 5xx;
  • agent repeats calls because it sees rare successes;
  • queue/latency rise even without clear global incident.

How To Stop These Failures

In practice, this means:

  1. lock dependency health snapshot at run start;
  2. on threshold breach, immediately switch workflow to degraded mode;
  3. keep retries in one tool gateway with strict budget;
  4. return partial/fallback with explicit stop reason instead of "infinite" waiting.

Minimal guard for partial outage:

PYTHON
from dataclasses import dataclass
import time


RETRYABLE = {408, 429, 500, 502, 503, 504}


@dataclass(frozen=True)
class OutageLimits:
    max_retry_per_call: int = 2
    max_retry_total: int = 6
    max_run_seconds: int = 45
    max_tool_calls: int = 14
    degraded_error_threshold: float = 0.35
    min_sample_size: int = 5


class PartialOutageGuard:
    def __init__(self, limits: OutageLimits = OutageLimits()):
        self.limits = limits
        self.started_at = time.time()
        self.tool_calls = 0
        self.retry_count = 0
        self.total_calls = 0
        self.error_calls = 0

    def before_tool_call(self) -> str | None:
        self.tool_calls += 1
        if self.tool_calls > self.limits.max_tool_calls:
            return "partial_outage:tool_call_budget"
        if (time.time() - self.started_at) > self.limits.max_run_seconds:
            return "partial_outage:run_timeout"
        return None

    def on_tool_result(self, status_code: int, attempt: int) -> str | None:
        self.total_calls += 1

        if status_code in RETRYABLE:
            self.error_calls += 1
            error_rate = self.error_calls / max(1, self.total_calls)
            if (
                self.total_calls >= self.limits.min_sample_size
                and error_rate >= self.limits.degraded_error_threshold
            ):
                return "partial_outage:degraded_mode"
            self.retry_count += 1
            if self.retry_count > self.limits.max_retry_total:
                return "partial_outage:retry_budget"
            if attempt >= self.limits.max_retry_per_call:
                return "partial_outage:retry_exhausted"
            return "partial_outage:retry_allowed"

        error_rate = self.error_calls / max(1, self.total_calls)
        if (
            self.total_calls >= self.limits.min_sample_size
            and error_rate >= self.limits.degraded_error_threshold
        ):
            return "partial_outage:degraded_mode"

        return None

In this version, retry_count tracks all retryable responses inside run, and attempt is the retry count of one specific call.

This is a baseline guard. In production, it is usually extended with per-tool health probes, circuit breaker, and dedicated safe-mode path for degraded runs. before_tool_call(...) and on_tool_result(...) are called in tool gateway, so degradation decision is centralized instead of duplicated in each layer.

Where This Is Implemented In Architecture

In production, partial-outage control is almost always split across three system layers.

Tool Execution Layer provides health signals: error rate, timeout patterns, retry budget, and circuit breaker. This is where you see dependency instability even if some calls still return 200.

Agent Runtime makes run decisions: switch to degraded mode, stop reasons, controlled finish with fallback. Without this layer, system keeps waiting for "one more successful call".

Orchestration Topologies defines how to isolate degraded workflow from the rest of system (bulkheads, queues, priorities). This prevents local degradation from turning into shared incident.

Self-check

Quick pre-release check. Tick the items and see the status below.
This is a short sanity check, not a formal audit.

Progress: 0/8

⚠ There are risk signals

Basic controls are missing. Close the key checklist points before release.

FAQ

Q: Why is partial outage often worse than full outage?
A: Because it is masked as "sometimes works". System does not stop and keeps burning time, tokens, and worker pool.

Q: Should we immediately disable degraded tool?
A: Not always. Usually you switch to degraded mode: cap retries, reduce fan-out, and move to partial/fallback path.

Q: Where should retries and degradation decisions be made?
A: In one tool gateway. Otherwise each layer runs its own retries and partial outage scales quickly.

Q: What should be shown to user when dependency degrades?
A: Explicit stop reason, what exactly failed, and controlled next step: partial response or retry after dependency recovers.


Partial outage almost never looks like a loud crash. It is a quiet degradation where system still moves but no longer holds quality and pace. That is why production agents need not only retries, but strict degradation mode and dependency isolation.

If this issue appears in production, these pages are also useful:

⏱️ 7 min read β€’ Updated March 12, 2026Difficulty: β˜…β˜…β˜†
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python β€” conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.