AI Agent Production Stack (The Stuff Between Your Agent and Disaster)

Your agent isn’t a single prompt. It’s a stack: budgets, tools, state, logs, and controls. This is the glue that stops incidents.
On this page
  1. The problem
  2. Why this happens in real systems
  3. What breaks if you ignore it
  4. The stack (what we actually run)
  5. Diagram (where the control layer sits)
  6. Layer-by-layer: what we learned the hard way
  7. Entry point
  8. Orchestrator
  9. Model layer
  10. Tool layer
  11. State layer
  12. Observability
  13. Control layer
  14. Code: orchestration skeleton (TypeScript)
  15. What we measure (because “it seems fine” is not a metric)
  16. Stop reasons taxonomy (so you can debug without vibes)
  17. Multi-tenant reality (where most incidents hide)
  18. Rate limits & circuit breakers (because agents amplify outages)
  19. The rollout we use (because agents don’t deserve trust on day one)
  20. Where “memory” belongs (hint: not inside prompts)
  21. Incident response (what you want ready before the first page)
  22. Testing & replay (because “works on my prompt” isn’t a test)
  23. The “boring first” build order
  24. Real failure
  25. Trade-offs
  26. When NOT to build an agent stack
  27. Links

The problem

In dev, your agent “works”.

In prod:

  • it loops on a flaky API
  • it makes 200 tool calls because “just one more”
  • you can’t explain what happened because the only log is the final answer

That’s not an LLM problem. That’s a stack problem.

Why this happens in real systems

Agents are basically:

  • a planner (LLM)
  • a runtime (your code)
  • side effects (tools)
  • state (memory/artifacts)
  • constraints (budgets/policy)
  • observability (logs/audit)

If you only build the planner, you’ll get paged.

What breaks if you ignore it

  • No audit = no postmortem (or the postmortem is “the model did it”)
  • No budgets = unbounded cost
  • No policy boundary = accidental writes with prod creds
  • No state = repeated work, duplicate tool calls, prompt bloat

The stack (what we actually run)

  1. Entry point: UI/API, auth, request id
  2. Orchestrator: routing, retries, budgets, tracing
  3. Model layer: LLM calls (with spend tracking)
  4. Tool layer: APIs, browser, DB (with allowlists)
  5. State: memory, artifacts, caches, idempotency keys
  6. Observability: structured logs, traces, audit events
  7. Control layer: policy engine, kill switch, incident stop

Diagram (where the control layer sits)

This is the mental model we use:

If you put “control” inside a prompt, you don’t have a control layer. You have a suggestion.

Layer-by-layer: what we learned the hard way

Entry point

The entry point is where you decide the blast radius.

Good defaults:

  • authenticate before the agent runs
  • generate a request id
  • bind tenant/environment to that request id
  • set a budget up front (don’t let the model negotiate budgets)

If you let the model pick the tenant or environment, you will eventually write to the wrong one.

Orchestrator

This is the runtime that keeps your agent honest:

  • step loop
  • timeouts
  • retry policy
  • tool allowlists
  • trace collection
  • stop reasons

If you don’t build this, every agent becomes a custom snowflake that fails differently. Snowflakes are cute until you operate them.

Model layer

Your model layer is mostly about:

  • provider fallbacks (if you have them)
  • spend tracking
  • predictable output formats (tool actions)

The model is not the only “unreliable” part. But it’s the only part everyone blames, because that’s easier than admitting the runtime is missing.

Tool layer

Tools are where the side effects live. This is where you enforce:

  • allowlists (what can be called)
  • permissions (what can be written)
  • idempotency keys (what can be repeated safely)
  • timeouts (what can’t hang)
  • rate limits (what can’t DDoS your dependencies)

The tool layer should not accept “do the thing” as input. It should accept structured args with validation.

State layer

State is not one bucket.

We split it into:

  • scratch: short-lived per-run notes (small, structured)
  • artifacts: outputs you need later (drafts, extracts, plans)
  • memory: what you want to carry across runs (carefully)
  • cache: dedupe expensive reads (URLs, KB lookups)

If you dump everything into “memory”, you get prompt bloat and worse answers. If you carry nothing, you get repeated work and duplicate tool calls.

Observability

If you can’t answer “what did it do?” you can’t run it in production.

Minimum observability:

  • action trace (steps, tool calls, stop reason)
  • structured tool logs (args hash, duration, status)
  • spend/cost estimation
  • per-tenant usage metrics

If you’re serious, add tracing (model call spans, tool spans). But even plain structured logs beat “the model said so”.

Control layer

The control layer is the piece you want to exist when you’re asleep:

  • budgets (hard limits)
  • tool permissions (least privilege)
  • approvals (for writes)
  • kill switch (operator stop)
  • incident stop (circuit breakers)

It’s not “security theater”. It’s what turns an LLM demo into a system you can leave running.

Code: orchestration skeleton (TypeScript)

You don’t need a massive framework. You do need explicit control points.

TS
type Budget = { maxSteps: number; maxSeconds: number; maxUsd: number };
type ToolName = "web.search" | "http.get" | "ticket.create";

type Policy = {
  allowTools: ToolName[];
  budget: Budget;
  requireApprovalFor: ToolName[];
};

type AuditEvent =
  | { type: "tool.call"; tool: ToolName; args: unknown; ms: number }
  | { type: "budget.stop"; reason: string }
  | { type: "kill"; reason: string };

export async function runAgent(input: string, policy: Policy) {
  const started = Date.now();
  const events: AuditEvent[] = [];

  for (let step = 0; step < policy.budget.maxSteps; step++) {
    if (Date.now() - started > policy.budget.maxSeconds * 1000) {
      events.push({ type: "budget.stop", reason: "time" });
      break;
    }
    if (await killSwitchIsOn()) {
      events.push({ type: "kill", reason: "operator" });
      break;
    }

    const action = await llmDecideNext(input); // returns {tool, args} or {finish}
    if (action.type === "finish") return { output: action.text, events };

    if (!policy.allowTools.includes(action.tool)) {
      throw new Error(`tool not allowed: ${action.tool}`);
    }
    if (policy.requireApprovalFor.includes(action.tool)) {
      await waitForHumanApproval(action); // (pseudo)
    }

    const t0 = Date.now();
    const obs = await callTool(action.tool, action.args); // must enforce timeouts + idempotency
    events.push({ type: "tool.call", tool: action.tool, args: action.args, ms: Date.now() - t0 });

    input = updateState(input, action, obs); // keep state small, structured
  }

  return { output: "stopped", events };
}

What we measure (because “it seems fine” is not a metric)

If you want to run agents in production, measure the boring stuff:

  • completion rate (did it finish vs hit budget?)
  • p50/p95 runtime
  • p50/p95 tool calls per run
  • cost per run (tokens + tool credits)
  • loop rate (runs stopped by loop guard)
  • policy deny rate (how often your allowlist blocks it)

If you don’t measure policy denies, you’ll “fix” the agent by widening permissions instead of fixing the task.

Stop reasons taxonomy (so you can debug without vibes)

If you ship agents without explicit stop reasons, your dashboards will be 100% vibes: “it didn’t work” → “it timed out” → “maybe the model was bad”.

We log a single stop_reason per run and we treat it like a contract. It’s the difference between:

  • “agent feels flaky”
  • “60% of runs stop on tool_timeout:http.get because the upstream is dying”

Common stop reasons we actually see:

  • finish
  • max_steps, max_seconds, max_usd
  • policy_deny:<tool>
  • approval_timeout
  • tool_timeout:<tool>
  • tool_error_exhausted:<tool>
  • loop_detected
  • operator_kill

Example event (this is the kind of boring line that saves a day later):

JSON
{
  "request_id": "req_9f2c",
  "tenant": "acme-prod",
  "steps": 25,
  "tool_calls": 17,
  "usd_estimate": 1.03,
  "stop_reason": "max_usd"
}

Yes, you can get fancy with “partial success” and “degraded mode”. Start with one stop reason. Make it consistent. Your on-call self will thank you.

Multi-tenant reality (where most incidents hide)

Multi-tenant agent systems fail in predictable ways:

  • wrong tenant context
  • cross-tenant caches
  • shared credentials
  • “global” tools that quietly access everything

Guardrails:

  • tenant id is set by the entry point, never by the model
  • caches are keyed by tenant + environment
  • credentials are scoped by tenant + environment
  • audit logs always include tenant id

If any of those are missing, you will eventually leak data.

Rate limits & circuit breakers (because agents amplify outages)

If a dependency is flaky, an agent is basically a failure amplifier: it retries, it searches for alternatives, it tries again, it “verifies”, it tries again.

This is how you turn:

  • “upstream API returns 500 for 2 minutes” into
  • “we sent 80k requests and got rate-limited for an hour”

We do three boring things:

  1. Per-tool concurrency caps (per tenant). Example: browser tool max 2 concurrent runs. Anything more is a self-DDoS.
  2. Rate limiting at the tool boundary. Not inside the model.
  3. Circuit breakers that fail fast when error rate spikes.

Pseudo code:

TS
const httpGet = rateLimit({ perTenantRps: 5 }, async (url: string) => {
  return fetch(url, { signal: AbortSignal.timeout(8000) });
});

const breaker = new CircuitBreaker({
  windowMs: 30_000,
  failureRate: 0.5,
  cooldownMs: 60_000,
});

const res = await breaker.exec(() => httpGet("https://api.example.com/health"));

When the breaker is open, we stop the run with a clear reason (tool_unhealthy:http.get), and we don’t pretend the model can “reason” its way through an outage. It can’t. It’ll just burn budget.

The rollout we use (because agents don’t deserve trust on day one)

Shipping to prod is not a binary switch.

We ship like this:

  1. internal users only
  2. read-only tools only
  3. small canary percentage
  4. gradually expand permissions (with approvals for writes)
  5. only then consider “autonomous” behavior

And yes: we keep the kill switch handy the entire time.

Where “memory” belongs (hint: not inside prompts)

If you store everything in the prompt, you get:

  • ballooning context windows
  • worse answers (the model drowns in noise)
  • higher cost

We prefer:

  • small structured scratchpad per run
  • artifacts stored externally (drafts, notes, citations)
  • optional long-term memory with strict scoping + TTL

Memory is a product feature. Treat it like one. Test it. Audit it. Scope it.

Incident response (what you want ready before the first page)

Agents will fail. The question is whether you can stop the damage fast.

Before you ship, make sure you can:

  • disable a tool (browser, email, payments) without deploying code
  • disable a tenant without taking down everyone else
  • find a single run by request id
  • replay a run in a safe environment
  • answer “what tool calls happened?” in under a minute

If you can’t do those, the first incident will be slow and painful.

Testing & replay (because “works on my prompt” isn’t a test)

The annoying truth: agent behavior changes when you change anything. Model version. Prompt. Tool schema. Upstream API responses. Even timeouts.

So we test the stack, not just the prompt:

  • record/replay tool responses in a sandbox (same inputs, stable outputs)
  • run a small suite of “golden” tasks on every deploy
  • assert on traces, not just the final text (steps, tools, stop_reason)

This caught real regressions for us:

  • a tool schema rename caused the agent to loop on validation errors
  • a retry tweak doubled tool calls (cost went up ~2× overnight)

If you can’t replay a run deterministically, debugging becomes archaeology.

The “boring first” build order

If you’re starting from scratch, build in this order:

  1. tool wrapper (allowlist + timeouts + idempotency)
  2. budgets (steps/time) + stop reasons
  3. audit events (tool calls with args hash)
  4. kill switch
  5. only then: fancy planning, memory, multi-agent routing

Most teams do it backwards because demos reward “smart”. Production rewards “stops when things are weird”.

Real failure

We once shipped a “working” agent without structured audit events. Then it did something weird in production.

Postmortem timeline:

  • “it called the tool a lot”
  • “we think it retried”
  • “we can’t tell what arguments it used”

That cost ~half a day of engineering time, mostly arguing about what happened.

Fix:

  • every tool call emits a structured event (tool, args hash, duration, status)
  • a request id is threaded through everything
  • the kill switch is one click, not a code deploy

Trade-offs

  • More instrumentation = more code.
  • More policy = more “agent refused” cases.
  • It’s still cheaper than debugging blind.

When NOT to build an agent stack

If this is a one-off internal script that runs once a week, don’t over-engineer it. But if it touches production systems or real money, you need the stack, period.

Not sure this is your use case?

Design your agent ->
⏱️ 10 min readUpdated Mar, 2026Difficulty: ★★★
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.