Partial Outage Handling (Agent Failure + Degrade Mode + Code)

  • Spot the failure early before the bill climbs.
  • Learn what breaks in production and why.
  • Copy guardrails: budgets, stop reasons, validation.
  • Know when this isn’t the real root cause.
Detection signals
  • Tool calls per run spikes (or repeats with same args hash).
  • Spend or tokens per request climbs without better outputs.
  • Retries shift from rare to constant (429/5xx).
Some tools are down, some are up. Agents that keep trying will thrash and burn budgets. Here’s how to degrade safely with partial results and clear stop reasons.
On this page
  1. Quick take
  2. Problem-first intro
  3. Why this fails in production
  4. 1) The agent treats intermittent success as “keep trying”
  5. 2) No concept of tool health
  6. 3) No safe-mode behavior
  7. 4) “All or nothing” outputs force bad behavior
  8. Implementation example (real code)
  9. Example incident (numbers are illustrative)
  10. Trade-offs
  11. When NOT to use
  12. Copy-paste checklist
  13. Safe default config snippet (JSON/YAML)
  14. FAQ (3–5)
  15. Related pages (3–6 links)
Interactive flow
Scenario:
Step 1/2: Execution

Normal path: execute → tool → observe.

Quick take

  • Take a tool health snapshot at run start (breaker state + recent errors).
  • If a critical dependency is degraded, disable it for the run and switch to degrade mode.
  • Return partial results + explicit stop reason (don’t spin until timeout).
  • Budgets still apply (time/tool calls/spend) — outages amplify loops.

Problem-first intro

It’s not a full outage. It’s worse.

One tool is flaky:

  • sometimes 200
  • sometimes timeout
  • sometimes 502

Your agent keeps trying to “finish the task”. Users keep retrying because they get timeouts. Budgets keep burning because every retry is a new run.

Partial outages are where you learn whether your agent is an engineer or a gambler.

Why this fails in production

Partial outages are hard because success is intermittent. That tempts loops.

1) The agent treats intermittent success as “keep trying”

LLMs are optimistic. If they get one partial result, they’ll often keep going to “complete it”.

That’s fine in a notebook. In prod it’s runaway spend.

2) No concept of tool health

If the agent doesn’t know “tool X is degraded”, it will:

  • keep calling it
  • retry it
  • replan around it and call it again

You need a shared health signal:

  • circuit breaker state
  • recent error rate
  • latency spikes

3) No safe-mode behavior

When a tool is degraded, you need a plan that doesn’t depend on it:

  • use cached data
  • return partial results
  • stop with a reason and let the user decide

4) “All or nothing” outputs force bad behavior

If your API contract is “always return the full answer”, your agent will thrash during partial outages.

Better contract:

  • return partial results + confidence + stop reason
  • optionally: an async continuation
Diagram
Partial outage decision (normal vs degrade mode)

Implementation example (real code)

This pattern uses a “health snapshot” taken at the start of a run. If a critical tool is degraded, we:

  • disable it for the run
  • switch to safe-mode behavior
  • return partial results with an explicit stop reason
PYTHON
from dataclasses import dataclass
from typing import Any


@dataclass(frozen=True)
class Health:
  degraded_tools: set[str]


def snapshot_health() -> Health:
  # In real code: breaker states + recent error rates.
  return Health(degraded_tools=set(get_degraded_tools()))  # (pseudo)


def safe_tools_for_run(health: Health) -> set[str]:
  allow = {"search.read", "kb.read", "http.get"}
  # During outages: be conservative.
  for t in health.degraded_tools:
      allow.discard(t)
  return allow


def run(task: str) -> dict[str, Any]:
  health = snapshot_health()
  allow = safe_tools_for_run(health)

  if "kb.read" not in allow:
      return {
          "status": "degraded",
          "reason": "kb.read degraded",
          "partial": "I can’t reliably read the KB right now. Here’s what I can do without it…",
      }

  # Normal loop would run here with a tool gateway allowlist = allow.
  return agent_loop(task, allow=allow)  # (pseudo)
JAVASCRIPT
export function snapshotHealth() {
// Real code: breaker states + recent error rates.
return { degradedTools: new Set(getDegradedTools()) }; // (pseudo)
}

export function safeToolsForRun(health) {
const allow = new Set(["search.read", "kb.read", "http.get"]);
for (const t of health.degradedTools) allow.delete(t);
return allow;
}

export function run(task) {
const health = snapshotHealth();
const allow = safeToolsForRun(health);

if (!allow.has("kb.read")) {
  return {
    status: "degraded",
    reason: "kb.read degraded",
    partial: "I can’t reliably read the KB right now. Here’s what I can do without it…",
  };
}

return agentLoop(task, { allow }); // (pseudo)
}

This is intentionally conservative. During partial outages, your goal is not “succeed at all costs”. Your goal is “don’t turn a partial outage into a full outage”.

Example incident (numbers are illustrative)

Example: an agent that answered support questions using kb.read.

The KB service degraded (p95 latency from ~300ms → 9s, intermittent timeouts). Our agent kept trying because sometimes it worked.

Impact:

  • average run time: 8s → 52s
  • client retries doubled traffic
  • on-call got paged for “agent timeouts”, not for the real KB issue
  • spend increased ~$180/day just on retries + longer prompts

Fix:

  1. health snapshot + degrade mode
  2. fail fast after breaker opens
  3. return partial results + clear stop reason
  4. a “retry later” hint instead of silent timeouts

Partial outages are where we learned: user-visible stop reasons are a feature.

Trade-offs

  • Degrade mode answers are less complete.
  • Failing fast reduces success rate in the moment.
  • Health signals can be wrong (false positives). That’s better than thrashing.

When NOT to use

  • If you need strict completeness, run async and report progress instead of looping synchronously.
  • If you can’t define partial output semantics, you’ll be forced into timeouts (bad).
  • If you don’t have tool health signals, start with budgets and breaker defaults.

Copy-paste checklist

  • [ ] Tool health snapshot at run start
  • [ ] Degrade mode policy (tools disabled, read-only, cached)
  • [ ] Fail fast when breaker is open
  • [ ] Return partial results + explicit stop reason
  • [ ] Budget caps (time/tool calls/spend) still apply
  • [ ] Alerting on degraded runs vs normal runs

Safe default config snippet (JSON/YAML)

YAML
degrade_mode:
  enabled: true
  disable_tools_when_degraded: true
  allow_partial_results: true
health:
  breaker_open_means_degraded: true
budgets:
  max_seconds: 60
  max_tool_calls: 12

FAQ (3–5)

Why not just keep retrying until it works?
Because intermittent failures + retries amplify outages. Your agent becomes a load generator.
What should I return in degrade mode?
Partial results, cached data, or a clear ‘can’t do this right now’ with a stop reason.
Do I need per-tool health?
Yes for external dependencies. Start with breaker state and recent error rates.
How do users handle partial results?
Better than timeouts. Give them a stop reason and an option to retry later.

Not sure this is your use case?

Design your agent ->
⏱️ 6 min readUpdated Mar, 2026Difficulty: ★★☆
Implement in OnceOnly
Guardrails for loops, retries, and spend escalation.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
controls:
  loop_detection:
    enabled: true
    dedupe_by: [tool, args_hash]
  retries:
    max: 2
    backoff_ms: [200, 800]
stop_reasons:
  enabled: true
logging:
  tool_calls: { enabled: true, store_args: false, store_args_hash: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Kill switch & incident stop
  • Audit logs & traceability
  • Idempotency & dedupe
  • Tool permissions (allowlist / blocklist)
Integrated mention: OnceOnly is a control layer for production agent systems.
Example policy (concept)
# Example (Python — conceptual)
policy = {
  "budgets": {"steps": 20, "seconds": 60, "usd": 1.0},
  "controls": {"kill_switch": True, "audit": True},
}
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.