Problem (the âtiny changeâ that cost you a week)
You tweak a prompt because the agent sounded a little off.
It looks fine in dev.
Then production happens:
- tool calls per run creep up (nobody notices for a day)
- latency doubles under load
- the agent starts âhelpfullyâ ignoring stop conditions
- a rare edge case becomes a daily incident
Prompt optimization is real engineering work, not copywriting. If you treat it like âedit text until it feels betterâ, youâll ship regressions with a smile.
Why this fails in production
Prompts fail differently in prod because theyâre entangled with:
- tool schemas (one renamed field and the model âdoesnât understandâ)
- budgets (token inflation makes loops more expensive)
- stop reasons (a prompt can accidentally encourage âone more tryâ forever)
- external variance (search results drift, APIs get flaky, rate limits hit)
The uncomfortable truth: you canât safely optimize a prompt without a harness. That harness doesnât need to be fancy. It needs to be consistent.
Diagram: the safe prompt pipeline
Real code: version your prompts like you version code (Python + JS)
This is the minimum thatâs worth doing:
- every prompt has a stable
prompt_id(content hash or version string) - every run logs the
prompt_id - you can roll back by switching one config value
import hashlib
from dataclasses import dataclass
from typing import Any, Dict, Optional
def prompt_id(text: str) -> str:
return hashlib.sha256(text.encode("utf-8")).hexdigest()[:12]
@dataclass(frozen=True)
class Prompt:
name: str
version: str
text: str
@property
def id(self) -> str:
return f"{self.name}:{self.version}:{prompt_id(self.text)}"
class Logger:
def event(self, name: str, fields: Dict[str, Any]) -> None: ...
def build_system_prompt(p: Prompt) -> str:
# Keep the prompt boring and structured. Production loves boring.
return (
"You are a production agent. You must follow tool policies and budgets.\n"
"Always stop with a stop_reason.\n\n"
f"[prompt_id={p.id}]\n"
+ p.text.strip()
)
def run_agent(task: str, *, prompt: Prompt, logger: Logger, budgets: Dict[str, Any]) -> Dict[str, Any]:
sys = build_system_prompt(prompt)
logger.event("agent_start", {"prompt_id": prompt.id, "budget": budgets})
# ... call LLM + tools (not shown) ...
# Make sure your trace includes prompt_id. Otherwise you canât compare runs.
return {"output": "ok", "prompt_id": prompt.id, "stop_reason": "finish"}
# --- usage ---
PROMPTS = {
"support:v12": Prompt("support", "v12", "Answer using KB. If unsure, ask a clarifying question."),
"support:v13": Prompt("support", "v13", "Answer using KB. Cite tool results. If unsure, ask a clarifying question."),
}
ACTIVE_PROMPT = PROMPTS["support:v13"] # roll back by changing this one line (or env var)import crypto from "node:crypto";
export function promptId(text) {
return crypto.createHash("sha256").update(text, "utf8").digest("hex").slice(0, 12);
}
export function buildSystemPrompt(prompt) {
return [
"You are a production agent. You must follow tool policies and budgets.",
"Always stop with a stop_reason.",
"",
"[prompt_id=" + prompt.id + "]",
prompt.text.trim(),
].join("\n");
}
export function makePrompt({ name, version, text }) {
const id = `${name}:${version}:${promptId(text)}`;
return { name, version, text, id };
}
export function runAgent(task, { prompt, logger, budgets }) {
const sys = buildSystemPrompt(prompt);
logger.event("agent_start", { prompt_id: prompt.id, budget: budgets });
// ... call LLM + tools (not shown) ...
return { output: "ok", prompt_id: prompt.id, stop_reason: "finish" };
}
// --- usage ---
const PROMPTS = {
"support:v12": makePrompt({ name: "support", version: "v12", text: "Answer using KB. If unsure, ask a clarifying question." }),
"support:v13": makePrompt({ name: "support", version: "v13", text: "Answer using KB. Cite tool results. If unsure, ask a clarifying question." }),
};
const ACTIVE_PROMPT = PROMPTS["support:v13"]; // rollback by switching one config valueNow combine this with a small golden-task set and a couple of invariants:
- did
stop_reasonchange? - did
tool_calls/runspike? - did tokens per request inflate?
Thatâs how you stop âone innocent prompt editâ from turning into a cost leak.
Real failure (incident-style, with numbers)
We âimprovedâ a support agent prompt by adding a long âbe thoroughâ instruction.
Nothing crashed. But the model started doing what we asked: it became thorough.
Impact over 36 hours:
- p95 tokens/run: 7.5k â 14.2k
- avg tool calls/run: 4 â 11
- spend: +$620 (tokens + tool credits)
- on-call time: ~2 hours to prove it was prompt-driven, not traffic
Fix:
- cap budgets (steps/tool calls/tokens)
- add a golden-task invariant: max tool calls and max tokens per run
- move âthoroughnessâ into a conditional rule (only after a tool result is missing)
Prompt optimization works. It just works both ways.
Trade-offs
- Aggressive prompts can improve quality and also increase spend. You donât get both for free.
- Short prompts are fast but often under-specify safety rules.
- âSmartâ instructions tend to hide failure modes. Explicit contracts are uglier and safer.
When NOT to optimize prompts
Donât prompt-optimize your way out of:
- missing budgets (fix
/governance/budget-controls) - missing tool validation (
/tools/input-validation) - missing logging (
/observability-monitoring/agent-logging) - missing tests (
/testing-evaluation/unit-testing-agents)
If your agent is unstable, prompt changes will just random-walk the instability.
Copy-paste checklist
- [ ] Stable
prompt_idlogged on every run - [ ] Golden tasks (10â50) that mirror real user traffic
- [ ] Invariants: stop_reason present, tool_calls/run bound, token bound
- [ ] Canary rollout + rollback switch
- [ ] Monitor: spend/run, tool_calls/run, latency/run
- [ ] One post-incident golden task per incident (yes, itâs annoying; do it anyway)
Safe default config snippet (YAML)
prompts:
active: "support:v13"
rollback: "support:v12"
require_prompt_id: true
testing:
golden_tasks:
- id: "kb_lookup"
expect_stop_reason: "finish"
max_tool_calls: 6
max_tokens: 9000
budgets:
max_steps: 25
max_tool_calls: 12
max_seconds: 60
observability:
log_prompt_id: true
alert_on_token_spike: true
Implement in OnceOnly (optional)
# onceonly-python: budgets + safe rollout guardrails
import os
from onceonly import OnceOnly
client = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"])
agent_id = "support-bot"
# Set budgets/limits before you ship prompt changes
client.gov.upsert_policy({
"agent_id": agent_id,
"max_actions_per_hour": 200,
"max_spend_usd_per_day": 50.0,
"max_calls_per_tool": {"kb.search": 6},
"allowed_tools": ["kb.search", "send_email"],
})
# After rollout, watch for spend/tool spikes
m = client.gov.agent_metrics(agent_id, period="day")
print("actions=", m.total_actions, "spend_usd=", m.total_spend_usd)
FAQ (3â5)
Used by patterns
Related failures
Q: Should I A/B test prompts in production?
A: Only if rollback is instant and youâre watching spend/run and tool_calls/run. Otherwise itâs just a controlled incident.
Related pages (3â6 links)
- Foundations: How agents use tools · Planning vs reactive agents
- Failures: Hallucinated sources · Budget explosion
- Governance: Budget controls · Tool permissions
- Observability: AI agent logging
- Testing: Unit testing agents