Prompt Optimization for AI Agents (Without Breaking Production)

Problem (the “tiny change” that cost you a week)

You tweak a prompt because the agent sounded a little off.

It looks fine in dev.

Then production happens:

tool calls per run creep up (nobody notices for a day)
latency doubles under load
the agent starts “helpfully” ignoring stop conditions
a rare edge case becomes a daily incident

Prompt optimization is real engineering work, not copywriting. If you treat it like “edit text until it feels better”, you’ll ship regressions with a smile.

Why this fails in production

Prompts fail differently in prod because they’re entangled with:

tool schemas (one renamed field and the model “doesn’t understand”)
budgets (token inflation makes loops more expensive)
stop reasons (a prompt can accidentally encourage “one more try” forever)
external variance (search results drift, APIs get flaky, rate limits hit)

The uncomfortable truth: you can’t safely optimize a prompt without a harness. That harness doesn’t need to be fancy. It needs to be consistent.

Diagram: the safe prompt pipeline

Real code: version your prompts like you version code (Python + JS)

This is the minimum that’s worth doing:

every prompt has a stable prompt_id (content hash or version string)
every run logs the prompt_id
you can roll back by switching one config value

PythonJS

PYTHON

import hashlib
from dataclasses import dataclass
from typing import Any, Dict, Optional


def prompt_id(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()[:12]


@dataclass(frozen=True)
class Prompt:
    name: str
    version: str
    text: str

    @property
    def id(self) -> str:
        return f"{self.name}:{self.version}:{prompt_id(self.text)}"


class Logger:
    def event(self, name: str, fields: Dict[str, Any]) -> None: ...


def build_system_prompt(p: Prompt) -> str:
    # Keep the prompt boring and structured. Production loves boring.
    return (
        "You are a production agent. You must follow tool policies and budgets.\n"
        "Always stop with a stop_reason.\n\n"
        f"[prompt_id={p.id}]\n"
        + p.text.strip()
    )


def run_agent(task: str, *, prompt: Prompt, logger: Logger, budgets: Dict[str, Any]) -> Dict[str, Any]:
    sys = build_system_prompt(prompt)
    logger.event("agent_start", {"prompt_id": prompt.id, "budget": budgets})

    # ... call LLM + tools (not shown) ...
    # Make sure your trace includes prompt_id. Otherwise you can’t compare runs.
    return {"output": "ok", "prompt_id": prompt.id, "stop_reason": "finish"}


# --- usage ---
PROMPTS = {
    "support:v12": Prompt("support", "v12", "Answer using KB. If unsure, ask a clarifying question."),
    "support:v13": Prompt("support", "v13", "Answer using KB. Cite tool results. If unsure, ask a clarifying question."),
}

ACTIVE_PROMPT = PROMPTS["support:v13"]  # roll back by changing this one line (or env var)

JAVASCRIPT

import crypto from "node:crypto";

export function promptId(text) {
  return crypto.createHash("sha256").update(text, "utf8").digest("hex").slice(0, 12);
}

export function buildSystemPrompt(prompt) {
  return [
    "You are a production agent. You must follow tool policies and budgets.",
    "Always stop with a stop_reason.",
    "",
    "[prompt_id=" + prompt.id + "]",
    prompt.text.trim(),
  ].join("\n");
}

export function makePrompt({ name, version, text }) {
  const id = `${name}:${version}:${promptId(text)}`;
  return { name, version, text, id };
}

export function runAgent(task, { prompt, logger, budgets }) {
  const sys = buildSystemPrompt(prompt);
  logger.event("agent_start", { prompt_id: prompt.id, budget: budgets });

  // ... call LLM + tools (not shown) ...
  return { output: "ok", prompt_id: prompt.id, stop_reason: "finish" };
}

// --- usage ---
const PROMPTS = {
  "support:v12": makePrompt({ name: "support", version: "v12", text: "Answer using KB. If unsure, ask a clarifying question." }),
  "support:v13": makePrompt({ name: "support", version: "v13", text: "Answer using KB. Cite tool results. If unsure, ask a clarifying question." }),
};

const ACTIVE_PROMPT = PROMPTS["support:v13"]; // rollback by switching one config value

Now combine this with a small golden-task set and a couple of invariants:

did stop_reason change?
did tool_calls/run spike?
did tokens per request inflate?

That’s how you stop “one innocent prompt edit” from turning into a cost leak.

Real failure (incident-style, with numbers)

We “improved” a support agent prompt by adding a long “be thorough” instruction.

Nothing crashed. But the model started doing what we asked: it became thorough.

Impact over 36 hours:

p95 tokens/run: 7.5k → 14.2k
avg tool calls/run: 4 → 11
spend: +$620 (tokens + tool credits)
on-call time: ~2 hours to prove it was prompt-driven, not traffic

Fix:

cap budgets (steps/tool calls/tokens)
add a golden-task invariant: max tool calls and max tokens per run
move “thoroughness” into a conditional rule (only after a tool result is missing)

Prompt optimization works. It just works both ways.

Trade-offs

Aggressive prompts can improve quality and also increase spend. You don’t get both for free.
Short prompts are fast but often under-specify safety rules.
“Smart” instructions tend to hide failure modes. Explicit contracts are uglier and safer.

When NOT to optimize prompts

Don’t prompt-optimize your way out of:

missing budgets (fix /governance/budget-controls)
missing tool validation (/tools/input-validation)
missing logging (/observability-monitoring/agent-logging)
missing tests (/testing-ai-agents/unit-testing-agents)

If your agent is unstable, prompt changes will just random-walk the instability.

Copy-paste checklist

[ ] Stable prompt_id logged on every run
[ ] Golden tasks (10–50) that mirror real user traffic
[ ] Invariants: stop_reason present, tool_calls/run bound, token bound
[ ] Canary rollout + rollback switch
[ ] Monitor: spend/run, tool_calls/run, latency/run
[ ] One post-incident golden task per incident (yes, it’s annoying; do it anyway)

Safe default config snippet (YAML)

YAML

prompts:
  active: "support:v13"
  rollback: "support:v12"
  require_prompt_id: true
testing:
  golden_tasks:
    - id: "kb_lookup"
      expect_stop_reason: "finish"
      max_tool_calls: 6
      max_tokens: 9000
budgets:
  max_steps: 25
  max_tool_calls: 12
  max_seconds: 60
observability:
  log_prompt_id: true
  alert_on_token_spike: true

Implement in OnceOnly (optional)

Implement in OnceOnly

Gate prompt changes behind budgets + stop reasons + logging.

Use in OnceOnly

# onceonly-python: budgets + safe rollout guardrails
import os
from onceonly import OnceOnly

client = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"])
agent_id = "support-bot"

# Set budgets/limits before you ship prompt changes
client.gov.upsert_policy({
    "agent_id": agent_id,
    "max_actions_per_hour": 200,
    "max_spend_usd_per_day": 50.0,
    "max_calls_per_tool": {"kb.search": 6},
    "allowed_tools": ["kb.search", "send_email"],
})

# After rollout, watch for spend/tool spikes
m = client.gov.agent_metrics(agent_id, period="day")
print("actions=", m.total_actions, "spend_usd=", m.total_spend_usd)

FAQ (3–5)

Used by patterns

Related failures

Governance required

Should I A/B test prompts in production?

Only if you can roll back fast and you’re monitoring spend/run and tool_calls/run. Otherwise it’s just a controlled incident.

What’s a good ‘golden task’ set size?

Start with 10–20 tasks that reflect real traffic. Add one task per incident. You’ll get to 50 naturally.

What do I assert on for agent prompts?

Stop reason taxonomy, tool call counts, budget stops, and basic output constraints. Don’t assert on exact prose.

Can prompts replace guardrails?

No. Prompts reduce probability. Guardrails change what’s possible.

Q: Should I A/B test prompts in production?
A: Only if rollback is instant and you’re watching spend/run and tool_calls/run. Otherwise it’s just a controlled incident.

Foundations: How agents use tools · Planning vs reactive agents
Failures: Hallucinated sources · Budget explosion
Governance: Budget controls · Tool permissions
Observability: AI agent logging
Testing: Unit testing agents

Prompt Optimization for AI Agents (Without Breaking Production)

Problem (the “tiny change” that cost you a week)

Why this fails in production

Diagram: the safe prompt pipeline

Real code: version your prompts like you version code (Python + JS)

Real failure (incident-style, with numbers)

Trade-offs

When NOT to optimize prompts

Copy-paste checklist

Safe default config snippet (YAML)

Implement in OnceOnly (optional)

FAQ (3–5)

Used by patterns

Related failures

Governance required

Author

Editorial note

Prompt Optimization for AI Agents (Without Breaking Production)

Problem (the “tiny change” that cost you a week)

Why this fails in production

Diagram: the safe prompt pipeline

Real code: version your prompts like you version code (Python + JS)

Real failure (incident-style, with numbers)

Trade-offs

When NOT to optimize prompts

Copy-paste checklist

Safe default config snippet (YAML)

Implement in OnceOnly (optional)

FAQ (3–5)

Used by patterns

Related failures

Governance required

Related pages (3–6 links)

Author

Editorial note