Problem
The request looks standard: check payment and confirm order status.
But traces show something else: in 9 minutes, one run made 29 tool calls
(billing.get_invoice - 18, payments.verify - 11), and most ended with timeout or 5xx.
For this task class, it can be about ~$2.50 instead of the usual ~$0.12.
The service is formally not "dead": some calls still return 200.
But the user gets no final answer, while run backlog and latency keep growing.
The system does not crash.
It just gets stuck between tool errors and retries, slowly accumulating latency and run backlog.
Analogy: imagine a courier arriving at a closed warehouse, calling again, waiting, calling again, and returning to the same door over and over. They are always "in progress", but the order does not move. Tool failure in agents looks exactly the same: actions exist, result does not.
Why this happens
Tool failure is not only about an unstable API.
Usually the core issue is that runtime has no clear strategy for classifying and handling tool errors.
In production, it usually looks like this:
- an external service returns
timeout,5xx, or unstable payload; - runtime or tool gateway retries without clear error classification;
- non-retryable errors also enter retry loops;
- without circuit breaker and fallback, run hangs or burns budget.
The problem is not one random API error. The problem is that the system does not stop the failure wave before it becomes an incident.
This incident class is usually called agent tool failure -
when an agent system cannot operate reliably because of instability
or errors in external tools.
Which failures happen most often
To keep it practical, production teams usually see four tool failure patterns.
Transient failures
The tool occasionally returns 408/429/5xx.
With weak retry control, a short outage becomes a retry storm.
Typical cause: missing backoff+jitter and retry budget.
Wrong retry classification
401, 403, 404, 409, schema validation errors, or policy denials go into retries,
although they should stop immediately.
Typical cause: retryable and non-retryable are not split in one place.
Tool contract drift
The tool changes response format or error structure. The agent cannot interpret results reliably and starts "asking" the same service again.
Typical cause: no contract versioning and no payload validation in gateway.
Cascading failure
One problematic tool raises whole-system latency: workers are busy waiting, queue grows, other runs slow down too.
Typical cause: missing circuit breaker and fallback for degraded dependencies.
How to detect these problems
Tool failure is visible through combined runtime and gateway metrics.
| Metric | Tool failure signal | What to do |
|---|---|---|
tool_error_rate | sharp increase in 4xx/5xx/timeout | enable degraded mode and inspect dependency |
retry_attempts_per_call | too many retries per one call | limit retry budget, add backoff+jitter |
non_retryable_retry_rate | retries on 401/403/404/409/422 | stop run immediately with explicit stop reason |
circuit_open_rate | circuit breaker opens frequently | check tool SLA and fallback scenario |
queue_backlog | queue grows under normal traffic | clear stuck runs and reduce fan-out |
How to distinguish tool failure from agent logic failure
Not every failed run means the agent "thinks badly". The key criterion: where exactly the loop breaks.
Normal if:
- the error is localized to one external tool;
- stop reason directly points to dependency (
tool_timeout,tool_5xx,circuit_open); - after fallback, user still gets a partial but correct result.
Dangerous if:
- agent retries non-retryable errors as retryable;
- there are no clear stop reasons at tool gateway level;
- one tool failure drags the whole workflow down.
How to stop these failures
In practice, it looks like this:
- classify tool errors as retryable and non-retryable;
- keep retry policy in one tool gateway (backoff+jitter + budget);
- use circuit breaker for failure waves;
- when tool is unavailable, return fallback/partial result and stop reason.
Minimal guard for tool errors:
from dataclasses import dataclass
import time
RETRYABLE = {408, 429, 500, 502, 503, 504}
NON_RETRYABLE = {400, 401, 403, 404, 409, 422}
@dataclass(frozen=True)
class ToolFailureLimits:
max_retry: int = 2
open_circuit_after: int = 3
circuit_cooldown_s: int = 20
class ToolFailureGuard:
def __init__(self, limits: ToolFailureLimits = ToolFailureLimits()):
self.limits = limits
self.fail_streak = 0
self.circuit_open_until = 0.0
def before_call(self) -> str | None:
if time.time() < self.circuit_open_until:
return "tool_unavailable:circuit_open"
return None
def on_result(self, status_code: int, attempt: int) -> str | None:
if status_code in NON_RETRYABLE:
self.fail_streak = 0
return "tool_failure:non_retryable"
if status_code in RETRYABLE:
self.fail_streak += 1
if self.fail_streak >= self.limits.open_circuit_after:
self.circuit_open_until = time.time() + self.limits.circuit_cooldown_s
return "tool_unavailable:circuit_open"
if attempt >= self.limits.max_retry:
return "tool_failure:retry_exhausted"
return "tool_retry:allowed"
self.fail_streak = 0
return None
This is a baseline guard.
In production, it is usually extended with per-tool limits and exponential backoff with jitter.
attempt is usually 1-based (1, 2, 3...), and guard state is typically tracked per tool or per run.
Where this is implemented in architecture
In production, tool failure control is almost always split across three system layers.
Tool Execution Layer is the core control point: args and payload validation, retry policy, error classification, circuit breaker. If this layer is weak, even a simple API issue quickly turns into cascade.
Agent Runtime owns run lifecycle: stop reasons, timeout, controlled completion, and fallback response. This is where it is critical not to continue run at any cost.
Policy Boundaries defines which tools are allowed and when run must fail-closed. This is especially important for write-tools and permission errors.
Checklist
Before shipping an agent to production:
- [ ] retryable/non-retryable errors are explicitly separated;
- [ ] retries are implemented in one gateway, not across multiple layers;
- [ ]
max_retry, backoff+jitter, and retry budget are defined; - [ ] circuit breaker and cooldown are set for every critical tool;
- [ ] stop reasons cover
timeout,5xx,non_retryable,circuit_open; - [ ] fallback/partial response is defined before incident;
- [ ] alerts on
tool_error_rate,retry_attempts_per_call,queue_backlog; - [ ] runbook exists for degraded mode and dependency rollback.
FAQ
Q: Is it enough to just increase timeout for a problematic tool?
A: No. This often only masks the issue and increases latency. You need error classification, retry budget, and circuit breaker.
Q: Where should retries live?
A: In one choke point, usually tool gateway. Retries in multiple layers almost always create amplification.
Q: Which errors are usually non-retryable?
A: 401, 403, 404, 409, 422, schema validation errors, and policy denials. Such runs should usually stop immediately with explicit stop reason.
Q: What should users see when a tool is unavailable?
A: The stop reason, what is already checked, and a safe next step: fallback, partial result, or manual escalation.
Tool failure almost never looks like one large outage. More often, it is a series of small failures accumulating into retry loops and queue growth. That is why production agents need not only tools, but strict execution discipline.
Related pages
If this issue appears in production, it also helps to review:
- Why AI agents fail - general map of production failures.
- Tool spam - how repeated calls turn tool errors into incidents.
- Partial outage - how partial dependency degradation breaks workflow.
- Budget explosion - how retry storms silently inflate costs.
- Agent Runtime - where to control stop reasons and run lifecycle.
- Tool Execution Layer - where to keep retries, validation, and circuit breaker.