Rate Limiting for AI Agents: how to contain request spikes and retry storms

Idea in 30 seconds

Rate limiting is a runtime control that limits the frequency of an agent's external calls, so request spikes and retry storms do not escalate in production.

When you need it: when an agent frequently calls model/tool APIs, has retry logic, and runs under peak load.

Problem

Without rate limiting, one unstable service quickly creates a chain: retry → call → retry again. In demo environments this is barely visible. In production, that behavior creates waves of 429/5xx, queue growth, and latency.

The worst part is that the incident self-amplifies:

one overly active user consumes quotas
one tenant "chokes" others
a global spike breaks dependencies for everyone at once

And every minute without call-frequency control adds more retries, queues, and latency, until the system effectively DDoSes itself.

Analogy: it is like a metered ramp onto a highway. Without flow control, even a good road becomes a traffic jam within minutes.

Solution

The solution is to add a centralized rate-limit policy layer in runtime and tool gateway. Every external agent call is checked against per_user, per_tenant, global, and burst_tokens limits.

Policy returns a technical decision: allow or stop with explicit reason:

rate_limited_user
rate_limited_tenant
rate_limited_global
burst_limited

When stop is returned, runtime sends retry_after_ms to the client and does not execute the call. This is a separate system layer, not part of prompt or model logic.

Rate limiting ≠ step limits

These are different control levels:

Rate limiting limits frequency of external calls.
Step limits limit runtime-loop length and behavior.

One without the other is insufficient:

without rate limiting, external APIs fail under spikes and retry storms
without step limits, a run may still spin too long even at moderate call frequency

Example:

rate limiting: no more than per_user=6 calls per 10 seconds
step limits: max_steps=18, max_repeat_action=3

Rate-limiting control components

These components work together on every external agent call.

Component	What it controls	Key mechanics	Why
Per-user limit	Behavior of one user	`per_user` quota sliding window	Prevents one user from consuming all capacity
Per-tenant limit	Load of one tenant	`per_tenant` quota tenant-scoped keys	Isolates spikes between customers
Global limit	Total system load	`global` cap shared limiter	Protects external dependencies from mass spikes
Burst control	Short peak "spikes"	token bucket refill rate	Absorbs immediate jumps without full system stop
Rate-limit observability	Visibility into policy decisions	audit logs alerts on stop spikes	Does not limit calls directly, but helps identify spike sources fast

Example alert:

Slack: 🛑 Support-Agent hit rate_limited_tenant. retry_after=1200ms, tenant=t_42.

How it looks in architecture

Rate-limit policy layer sits between runtime and external model/tool APIs. Every decision (allow or stop) is recorded in audit log.

Every external agent call passes through this flow before execution: runtime does not execute calls directly, it first asks the policy layer for a decision.

Flow summary:

Runtime forms an external agent call
Policy checks per_user, per_tenant, global, and burst_tokens
allow → call is executed
stop → retry_after_ms and partial response are returned
both decisions are written to audit log

Example

A support agent processes many requests at once and retries crm.search multiple times.

With rate limiting:

per_user = 6 / 10s
per_tenant = 120 / min
global = 50 / s
burst_tokens = 5

→ the spike is stopped at policy level before dependencies and queues fail.

Rate limiting stops the incident directly before external calls, not after a wave of 429.

In code it looks like this

The simplified scheme above shows the main flow. Critical point: rate-limit check must be O(1) and atomic (usually via Redis/Lua or equivalent), otherwise it becomes a bottleneck under spikes. After stop(...), runtime typically returns a partial response to the client with explicit reason and retry_after_ms.

Example rate-limit config:

YAML

rate_limits:
  per_user_10s: 6
  per_tenant_min: 120
  global_rps: 50
  burst_tokens: 5
  refill_per_second: 2

PYTHON

action = planner.next(state)
action_key = make_action_key(action.name, action.args)

if not action.is_external_call():
    # execute_local — conditional helper for local actions without external API.
    # Decision.allow — conditional helper to keep a single outcome/reason model.
    local_result = execute_local(action)
    local_decision = Decision.allow(reason=None)
    audit.log(
        run_id,
        decision=local_decision.outcome,
        reason=local_decision.reason,
        scope="local",
        action=action.name,
        action_key=action_key,
        result=local_result.status,
    )
    return local_result

decision = rate_limit.check(
    user_id=state.user_id,
    tenant_id=state.tenant_id,
    action=action.name,
    now_ms=clock.now_ms(),
)

if decision.outcome == "stop":
    audit.log(
        run_id,
        decision=decision.outcome,
        reason=decision.reason,
        scope=decision.scope,
        retry_after_ms=decision.retry_after_ms,
        action=action.name,
        action_key=action_key,
    )
    alerts.notify_if_needed(run_id, decision.reason, scope=decision.scope)
    return stop(decision.reason, retry_after_ms=decision.retry_after_ms)

result = executor.execute(action)

audit.log(
    run_id,
    decision=decision.outcome,
    reason=decision.reason,
    scope=decision.scope,
    action=action.name,
    action_key=action_key,
    result=result.status,
)

return result

How it looks during execution

Scenario 1: stopped by tenant limit

Runtime forms external call crm.search.
Policy sees per_tenant quota exceeded.
Decision: stop (reason=rate_limited_tenant).
Runtime returns retry_after_ms.
Call is not executed, event is written to audit log.

Scenario 2: burst spike

Several runs create a short call spike at the same time.
Policy exhausts burst_tokens.
Decision: stop (reason=burst_limited).
Some calls are rejected with retry_after_ms.
System remains stable without cascade failure.

Scenario 3: normal execution

Runtime forms external call.
Policy checks limits: all within bounds.
Decision: allow.
Call is executed.
Decision and result are written to audit log.

Common mistakes

setting only global limit without per_user/per_tenant isolation
not returning retry_after on stop
retrying without backoff and jitter
checking rate limits in only one layer (only runtime or only gateway)
not logging stop decisions (reason, scope, retry_after_ms)
no alerting on rate_limited_* spikes

Result: system looks controlled, but degrades quickly under real spikes.

Self-check

Quick rate-limiting check before production launch:

There are separate limits: per_user, per_tenant, and global
There is burst control (token bucket or equivalent)
Rate-limit check is atomic and O(1)
Every external call passes through centralized policy layer
On stop, explicit retry_after is returned
All decisions (`allow` and `stop`) are written to audit log
There is alerting on rate_limited_user/tenant/global spikes
Retry logic uses backoff + jitter

Progress: 0/8

⚠ Baseline governance controls are missing

Before production, you need at least access control, limits, audit logs, and an emergency stop.

FAQ

Q: Which limits should we start with?
A: At minimum: per_user, per_tenant, global + a small burst control. Then tune based on real stop events.

Q: If external API already returns 429, do we still need our own rate limiting?
A: Yes. Internal rate limiting protects runtime before external 429 and gives controlled stop reasons, retry_after, and audit.

Q: What is better at limit hit: stop or queue?
A: For sync runs, stop + retry_after is usually better. For async pipelines, you can add queueing, but still with explicit limits and timeout.

Q: Does rate limiting replace budget controls?
A: No. Rate limiting controls call frequency, budget controls control total run spend.

Q: Where to store counters?
A: In shared low-latency storage with atomic operations (often Redis). Without this, limits become inconsistent across instances.

Where Rate Limiting fits in the system

Rate limiting is one of Agent Governance layers. Together with RBAC, budgets, step limits, approval, and audit, it forms a unified execution-control system.

Next on this topic:

Agent Governance Overview — overall control model for agents in production.
Budget Controls — how to limit total run spend.
Step limits — how to stop loops at runtime level.
Kill switch — how to emergency-stop actions without release.
Audit Logs for agents — how to analyze stop reasons and load spikes.

Rate Limiting for AI Agents: how to contain request spikes and retry storms

Idea in 30 seconds

Problem

Solution

Rate limiting ≠ step limits

Rate-limiting control components

How it looks in architecture

Example

In code it looks like this

How it looks during execution

Scenario 1: stopped by tenant limit

Scenario 2: burst spike

Scenario 3: normal execution

Common mistakes

Self-check

FAQ

Where Rate Limiting fits in the system

Used by patterns

Related failures

Governance required

Author

Editorial note

Rate Limiting for AI Agents: how to contain request spikes and retry storms

Idea in 30 seconds

Problem

Solution

Rate limiting ≠ step limits

Rate-limiting control components

How it looks in architecture

Example

In code it looks like this

How it looks during execution

Scenario 1: stopped by tenant limit

Scenario 2: burst spike

Scenario 3: normal execution

Common mistakes

Self-check

FAQ

Where Rate Limiting fits in the system

Related pages

Used by patterns

Related failures

Governance required

Author

Editorial note