Rate Limiting for AI Agents: how to contain request spikes and retry storms

Practical rate limiting in production: per-user/per-tenant/global limits, burst control, retry_after, backoff, audit logs, and alerting.
On this page
  1. Idea in 30 seconds
  2. Problem
  3. Solution
  4. Rate limiting β‰  step limits
  5. Rate-limiting control components
  6. How it looks in architecture
  7. Example
  8. In code it looks like this
  9. How it looks during execution
  10. Scenario 1: stopped by tenant limit
  11. Scenario 2: burst spike
  12. Scenario 3: normal execution
  13. Common mistakes
  14. Self-check
  15. FAQ
  16. Where Rate Limiting fits in the system
  17. Related pages

Idea in 30 seconds

Rate limiting is a runtime control that limits the frequency of an agent's external calls, so request spikes and retry storms do not escalate in production.

When you need it: when an agent frequently calls model/tool APIs, has retry logic, and runs under peak load.

Problem

Without rate limiting, one unstable service quickly creates a chain: retry β†’ call β†’ retry again. In demo environments this is barely visible. In production, that behavior creates waves of 429/5xx, queue growth, and latency.

The worst part is that the incident self-amplifies:

  • one overly active user consumes quotas
  • one tenant "chokes" others
  • a global spike breaks dependencies for everyone at once

And every minute without call-frequency control adds more retries, queues, and latency, until the system effectively DDoSes itself.

Analogy: it is like a metered ramp onto a highway. Without flow control, even a good road becomes a traffic jam within minutes.

Solution

The solution is to add a centralized rate-limit policy layer in runtime and tool gateway. Every external agent call is checked against per_user, per_tenant, global, and burst_tokens limits.

Policy returns a technical decision: allow or stop with explicit reason:

  • rate_limited_user
  • rate_limited_tenant
  • rate_limited_global
  • burst_limited

When stop is returned, runtime sends retry_after_ms to the client and does not execute the call. This is a separate system layer, not part of prompt or model logic.

Rate limiting β‰  step limits

These are different control levels:

  • Rate limiting limits frequency of external calls.
  • Step limits limit runtime-loop length and behavior.

One without the other is insufficient:

  • without rate limiting, external APIs fail under spikes and retry storms
  • without step limits, a run may still spin too long even at moderate call frequency

Example:

  • rate limiting: no more than per_user=6 calls per 10 seconds
  • step limits: max_steps=18, max_repeat_action=3

Rate-limiting control components

These components work together on every external agent call.

ComponentWhat it controlsKey mechanicsWhy
Per-user limitBehavior of one userper_user quota
sliding window
Prevents one user from consuming all capacity
Per-tenant limitLoad of one tenantper_tenant quota
tenant-scoped keys
Isolates spikes between customers
Global limitTotal system loadglobal cap
shared limiter
Protects external dependencies from mass spikes
Burst controlShort peak "spikes"token bucket
refill rate
Absorbs immediate jumps without full system stop
Rate-limit observabilityVisibility into policy decisionsaudit logs
alerts on stop spikes
Does not limit calls directly, but helps identify spike sources fast

Example alert:

Slack: πŸ›‘ Support-Agent hit rate_limited_tenant. retry_after=1200ms, tenant=t_42.

How it looks in architecture

Rate-limit policy layer sits between runtime and external model/tool APIs. Every decision (allow or stop) is recorded in audit log.

Every external agent call passes through this flow before execution: runtime does not execute calls directly, it first asks the policy layer for a decision.

Flow summary:

  • Runtime forms an external agent call
  • Policy checks per_user, per_tenant, global, and burst_tokens
  • allow β†’ call is executed
  • stop β†’ retry_after_ms and partial response are returned
  • both decisions are written to audit log

Example

A support agent processes many requests at once and retries crm.search multiple times.

With rate limiting:

  • per_user = 6 / 10s
  • per_tenant = 120 / min
  • global = 50 / s
  • burst_tokens = 5

β†’ the spike is stopped at policy level before dependencies and queues fail.

Rate limiting stops the incident directly before external calls, not after a wave of 429.

In code it looks like this

The simplified scheme above shows the main flow. Critical point: rate-limit check must be O(1) and atomic (usually via Redis/Lua or equivalent), otherwise it becomes a bottleneck under spikes. After stop(...), runtime typically returns a partial response to the client with explicit reason and retry_after_ms.

Example rate-limit config:

YAML
rate_limits:
  per_user_10s: 6
  per_tenant_min: 120
  global_rps: 50
  burst_tokens: 5
  refill_per_second: 2
PYTHON
action = planner.next(state)
action_key = make_action_key(action.name, action.args)

if not action.is_external_call():
    # execute_local β€” conditional helper for local actions without external API.
    # Decision.allow β€” conditional helper to keep a single outcome/reason model.
    local_result = execute_local(action)
    local_decision = Decision.allow(reason=None)
    audit.log(
        run_id,
        decision=local_decision.outcome,
        reason=local_decision.reason,
        scope="local",
        action=action.name,
        action_key=action_key,
        result=local_result.status,
    )
    return local_result

decision = rate_limit.check(
    user_id=state.user_id,
    tenant_id=state.tenant_id,
    action=action.name,
    now_ms=clock.now_ms(),
)

if decision.outcome == "stop":
    audit.log(
        run_id,
        decision=decision.outcome,
        reason=decision.reason,
        scope=decision.scope,
        retry_after_ms=decision.retry_after_ms,
        action=action.name,
        action_key=action_key,
    )
    alerts.notify_if_needed(run_id, decision.reason, scope=decision.scope)
    return stop(decision.reason, retry_after_ms=decision.retry_after_ms)

result = executor.execute(action)

audit.log(
    run_id,
    decision=decision.outcome,
    reason=decision.reason,
    scope=decision.scope,
    action=action.name,
    action_key=action_key,
    result=result.status,
)

return result

How it looks during execution

Scenario 1: stopped by tenant limit

  1. Runtime forms external call crm.search.
  2. Policy sees per_tenant quota exceeded.
  3. Decision: stop (reason=rate_limited_tenant).
  4. Runtime returns retry_after_ms.
  5. Call is not executed, event is written to audit log.

Scenario 2: burst spike

  1. Several runs create a short call spike at the same time.
  2. Policy exhausts burst_tokens.
  3. Decision: stop (reason=burst_limited).
  4. Some calls are rejected with retry_after_ms.
  5. System remains stable without cascade failure.

Scenario 3: normal execution

  1. Runtime forms external call.
  2. Policy checks limits: all within bounds.
  3. Decision: allow.
  4. Call is executed.
  5. Decision and result are written to audit log.

Common mistakes

  • setting only global limit without per_user/per_tenant isolation
  • not returning retry_after on stop
  • retrying without backoff and jitter
  • checking rate limits in only one layer (only runtime or only gateway)
  • not logging stop decisions (reason, scope, retry_after_ms)
  • no alerting on rate_limited_* spikes

Result: system looks controlled, but degrades quickly under real spikes.

Self-check

Quick rate-limiting check before production launch:

Progress: 0/8

⚠ Baseline governance controls are missing

Before production, you need at least access control, limits, audit logs, and an emergency stop.

FAQ

Q: Which limits should we start with?
A: At minimum: per_user, per_tenant, global + a small burst control. Then tune based on real stop events.

Q: If external API already returns 429, do we still need our own rate limiting?
A: Yes. Internal rate limiting protects runtime before external 429 and gives controlled stop reasons, retry_after, and audit.

Q: What is better at limit hit: stop or queue?
A: For sync runs, stop + retry_after is usually better. For async pipelines, you can add queueing, but still with explicit limits and timeout.

Q: Does rate limiting replace budget controls?
A: No. Rate limiting controls call frequency, budget controls control total run spend.

Q: Where to store counters?
A: In shared low-latency storage with atomic operations (often Redis). Without this, limits become inconsistent across instances.

Where Rate Limiting fits in the system

Rate limiting is one of Agent Governance layers. Together with RBAC, budgets, step limits, approval, and audit, it forms a unified execution-control system.

Next on this topic:

⏱️ 7 min read β€’ Updated March 27, 2026Difficulty: β˜…β˜…β˜…
Implement in OnceOnly
Budgets + permissions you can enforce at the boundary.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
  max_usd: 1.00
policy:
  tool_allowlist:
    - search.read
    - http.get
writes:
  require_approval: true
  idempotency: true
controls:
  kill_switch: { enabled: true }
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.