Idea in 30 seconds
Rate limiting is a runtime control that limits the frequency of an agent's external calls, so request spikes and retry storms do not escalate in production.
When you need it: when an agent frequently calls model/tool APIs, has retry logic, and runs under peak load.
Problem
Without rate limiting, one unstable service quickly creates a chain: retry β call β retry again.
In demo environments this is barely visible. In production, that behavior creates waves of 429/5xx, queue growth, and latency.
The worst part is that the incident self-amplifies:
- one overly active user consumes quotas
- one tenant "chokes" others
- a global spike breaks dependencies for everyone at once
And every minute without call-frequency control adds more retries, queues, and latency, until the system effectively DDoSes itself.
Analogy: it is like a metered ramp onto a highway. Without flow control, even a good road becomes a traffic jam within minutes.
Solution
The solution is to add a centralized rate-limit policy layer in runtime and tool gateway.
Every external agent call is checked against per_user, per_tenant, global, and burst_tokens limits.
Policy returns a technical decision: allow or stop with explicit reason:
rate_limited_userrate_limited_tenantrate_limited_globalburst_limited
When stop is returned, runtime sends retry_after_ms to the client and does not execute the call.
This is a separate system layer, not part of prompt or model logic.
Rate limiting β step limits
These are different control levels:
- Rate limiting limits frequency of external calls.
- Step limits limit runtime-loop length and behavior.
One without the other is insufficient:
- without rate limiting, external APIs fail under spikes and retry storms
- without step limits, a run may still spin too long even at moderate call frequency
Example:
- rate limiting: no more than
per_user=6calls per 10 seconds - step limits:
max_steps=18,max_repeat_action=3
Rate-limiting control components
These components work together on every external agent call.
| Component | What it controls | Key mechanics | Why |
|---|---|---|---|
| Per-user limit | Behavior of one user | per_user quotasliding window | Prevents one user from consuming all capacity |
| Per-tenant limit | Load of one tenant | per_tenant quotatenant-scoped keys | Isolates spikes between customers |
| Global limit | Total system load | global capshared limiter | Protects external dependencies from mass spikes |
| Burst control | Short peak "spikes" | token bucket refill rate | Absorbs immediate jumps without full system stop |
| Rate-limit observability | Visibility into policy decisions | audit logs alerts on stop spikes | Does not limit calls directly, but helps identify spike sources fast |
Example alert:
Slack: π Support-Agent hit rate_limited_tenant. retry_after=1200ms, tenant=t_42.
How it looks in architecture
Rate-limit policy layer sits between runtime and external model/tool APIs.
Every decision (allow or stop) is recorded in audit log.
Every external agent call passes through this flow before execution: runtime does not execute calls directly, it first asks the policy layer for a decision.
Flow summary:
- Runtime forms an external agent call
- Policy checks
per_user,per_tenant,global, andburst_tokens allowβ call is executedstopβretry_after_msand partial response are returned- both decisions are written to audit log
Example
A support agent processes many requests at once and retries crm.search multiple times.
With rate limiting:
per_user = 6 / 10sper_tenant = 120 / minglobal = 50 / sburst_tokens = 5
β the spike is stopped at policy level before dependencies and queues fail.
Rate limiting stops the incident directly before external calls, not after a wave of 429.
In code it looks like this
The simplified scheme above shows the main flow.
Critical point: rate-limit check must be O(1) and atomic (usually via Redis/Lua or equivalent), otherwise it becomes a bottleneck under spikes.
After stop(...), runtime typically returns a partial response to the client with explicit reason and retry_after_ms.
Example rate-limit config:
rate_limits:
per_user_10s: 6
per_tenant_min: 120
global_rps: 50
burst_tokens: 5
refill_per_second: 2
action = planner.next(state)
action_key = make_action_key(action.name, action.args)
if not action.is_external_call():
# execute_local β conditional helper for local actions without external API.
# Decision.allow β conditional helper to keep a single outcome/reason model.
local_result = execute_local(action)
local_decision = Decision.allow(reason=None)
audit.log(
run_id,
decision=local_decision.outcome,
reason=local_decision.reason,
scope="local",
action=action.name,
action_key=action_key,
result=local_result.status,
)
return local_result
decision = rate_limit.check(
user_id=state.user_id,
tenant_id=state.tenant_id,
action=action.name,
now_ms=clock.now_ms(),
)
if decision.outcome == "stop":
audit.log(
run_id,
decision=decision.outcome,
reason=decision.reason,
scope=decision.scope,
retry_after_ms=decision.retry_after_ms,
action=action.name,
action_key=action_key,
)
alerts.notify_if_needed(run_id, decision.reason, scope=decision.scope)
return stop(decision.reason, retry_after_ms=decision.retry_after_ms)
result = executor.execute(action)
audit.log(
run_id,
decision=decision.outcome,
reason=decision.reason,
scope=decision.scope,
action=action.name,
action_key=action_key,
result=result.status,
)
return result
How it looks during execution
Scenario 1: stopped by tenant limit
- Runtime forms external call
crm.search. - Policy sees
per_tenantquota exceeded. - Decision:
stop (reason=rate_limited_tenant). - Runtime returns
retry_after_ms. - Call is not executed, event is written to audit log.
Scenario 2: burst spike
- Several runs create a short call spike at the same time.
- Policy exhausts
burst_tokens. - Decision:
stop (reason=burst_limited). - Some calls are rejected with
retry_after_ms. - System remains stable without cascade failure.
Scenario 3: normal execution
- Runtime forms external call.
- Policy checks limits: all within bounds.
- Decision:
allow. - Call is executed.
- Decision and result are written to audit log.
Common mistakes
- setting only global limit without
per_user/per_tenantisolation - not returning
retry_afteronstop - retrying without backoff and jitter
- checking rate limits in only one layer (only runtime or only gateway)
- not logging stop decisions (
reason,scope,retry_after_ms) - no alerting on
rate_limited_*spikes
Result: system looks controlled, but degrades quickly under real spikes.
Self-check
Quick rate-limiting check before production launch:
Progress: 0/8
β Baseline governance controls are missing
Before production, you need at least access control, limits, audit logs, and an emergency stop.
FAQ
Q: Which limits should we start with?
A: At minimum: per_user, per_tenant, global + a small burst control. Then tune based on real stop events.
Q: If external API already returns 429, do we still need our own rate limiting?
A: Yes. Internal rate limiting protects runtime before external 429 and gives controlled stop reasons, retry_after, and audit.
Q: What is better at limit hit: stop or queue?
A: For sync runs, stop + retry_after is usually better. For async pipelines, you can add queueing, but still with explicit limits and timeout.
Q: Does rate limiting replace budget controls?
A: No. Rate limiting controls call frequency, budget controls control total run spend.
Q: Where to store counters?
A: In shared low-latency storage with atomic operations (often Redis). Without this, limits become inconsistent across instances.
Where Rate Limiting fits in the system
Rate limiting is one of Agent Governance layers. Together with RBAC, budgets, step limits, approval, and audit, it forms a unified execution-control system.
Related pages
Next on this topic:
- Agent Governance Overview β overall control model for agents in production.
- Budget Controls β how to limit total run spend.
- Step limits β how to stop loops at runtime level.
- Kill switch β how to emergency-stop actions without release.
- Audit Logs for agents β how to analyze stop reasons and load spikes.