Problem (why youâre here)
In dev, your agent âworksâ.
In prod, it does something weird once every 200 runs:
- a customer report says âit emailed the wrong thingâ
- costs spike for 15 minutes
- the agent loops on a flaky API and times out
And youâve got⊠basically nothing:
- one âfinal answerâ
- a few console logs
- maybe a tool error string without context
So you do the worst kind of debugging: guesswork with a credit card attached.
This page is about logging that makes incidents boring again.
Why this fails in production
Agents fail like distributed systems because they are distributed systems:
- the model is an unreliable planner
- tools are side effects (HTTP/DB/ticketing/email)
- retries and timeouts create emergent behavior
If you donât log the loop, you canât answer basic incident questions:
- What tool calls happened? In what order?
- What arguments were used (or at least which args-hash)?
- What did the tool return (or what did we redact)?
- Why did the run stop (
stop_reason)? - Which user/request triggered it?
If youâre not logging stop_reason, youâre not âobservingâ anything. Youâre collecting vibes.
Diagram: the minimum event pipeline
Real code: instrument the tool gateway (Python + JS)
Start with the boundary. Tools are where the money and damage live.
We log:
run_id,trace_id,tool_nameargs_hash(not raw args by default)- latency + status
error_class(normalized)
And we make it hard to âforgetâ to log by forcing everything through a gateway.
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Any, Callable, Dict, Optional
def stable_hash(obj: Any) -> str:
raw = json.dumps(obj, sort_keys=True, ensure_ascii=False).encode("utf-8")
return hashlib.sha256(raw).hexdigest()
@dataclass(frozen=True)
class RunCtx:
run_id: str
trace_id: str
user_id: Optional[str] = None
request_id: Optional[str] = None
class Logger:
def event(self, name: str, fields: Dict[str, Any]) -> None: ...
class ToolGateway:
def __init__(self, *, impls: dict[str, Callable[..., Any]], logger: Logger):
self.impls = impls
self.logger = logger
def call(self, ctx: RunCtx, name: str, args: Dict[str, Any]) -> Any:
fn = self.impls.get(name)
if not fn:
self.logger.event("tool_call", {
"run_id": ctx.run_id,
"trace_id": ctx.trace_id,
"tool": name,
"args_hash": stable_hash(args),
"ok": False,
"error_class": "unknown_tool",
})
raise RuntimeError(f"unknown tool: {name}")
t0 = time.time()
self.logger.event("tool_call", {
"run_id": ctx.run_id,
"trace_id": ctx.trace_id,
"tool": name,
"args_hash": stable_hash(args),
})
try:
out = fn(**args)
self.logger.event("tool_result", {
"run_id": ctx.run_id,
"trace_id": ctx.trace_id,
"tool": name,
"latency_ms": int((time.time() - t0) * 1000),
"ok": True,
})
return out
except TimeoutError:
self.logger.event("tool_result", {
"run_id": ctx.run_id,
"trace_id": ctx.trace_id,
"tool": name,
"latency_ms": int((time.time() - t0) * 1000),
"ok": False,
"error_class": "timeout",
})
raise
except Exception as e:
self.logger.event("tool_result", {
"run_id": ctx.run_id,
"trace_id": ctx.trace_id,
"tool": name,
"latency_ms": int((time.time() - t0) * 1000),
"ok": False,
"error_class": type(e).__name__,
})
raiseimport crypto from "node:crypto";
export function stableHash(obj) {
const raw = JSON.stringify(obj);
return crypto.createHash("sha256").update(raw).digest("hex");
}
export class ToolGateway {
constructor({ impls = {}, logger }) {
this.impls = impls;
this.logger = logger;
}
call(ctx, name, args) {
const fn = this.impls[name];
const argsHash = stableHash(args);
if (!fn) {
this.logger.event("tool_call", {
run_id: ctx.run_id,
trace_id: ctx.trace_id,
tool: name,
args_hash: argsHash,
ok: false,
error_class: "unknown_tool",
});
throw new Error("unknown tool: " + name);
}
const t0 = Date.now();
this.logger.event("tool_call", {
run_id: ctx.run_id,
trace_id: ctx.trace_id,
tool: name,
args_hash: argsHash,
});
try {
const out = fn(args);
this.logger.event("tool_result", {
run_id: ctx.run_id,
trace_id: ctx.trace_id,
tool: name,
latency_ms: Date.now() - t0,
ok: true,
});
return out;
} catch (e) {
this.logger.event("tool_result", {
run_id: ctx.run_id,
trace_id: ctx.trace_id,
tool: name,
latency_ms: Date.now() - t0,
ok: false,
error_class: e?.name || "Error",
});
throw e;
}
}If youâre not already doing it, pair this with:
- budgets (
/en/governance/budget-controls) - tool dedupe to reduce spam (
/en/failures/tool-spam) - and unit tests that assert your stop reasons donât drift (
/en/testing-evaluation/unit-testing-agents)
Real failure (incident-style, with numbers)
We shipped a âread-onlyâ research agent that called http.get.
Everything looked fine until an upstream partner API started returning 200s with error payloads (yep). Our tool wrapper treated â200 == okâ and logged only âsuccessâ.
Impact:
- ~18% of runs returned confidently wrong summaries for ~2 hours
- users filed ~30 tickets
- on-call time: ~4 hours to confirm it wasnât âthe model hallucinatingâ
The fix was boring and effective:
- log normalized
error_classand response validation failures - store
args_hash+ latency so we could find hot spots - add an alert: validation_fail_rate > 2% for 5 minutes
You donât need perfect logs. You need logs that answer âwhat happened?â in under 10 minutes.
Trade-offs
- Logging raw tool args is useful and also how you leak PII. Default to
args_hash. - Storing full tool results makes debugging easy and compliance painful. Prefer sampling + redaction.
- Too much logging is its own outage. Start with events you alert on.
When NOT to do this
- If the agent runs only in a trusted local environment, you can be lazier (for a while).
- If youâre still prototyping the loop shape daily, keep logs lightweight but consistent (IDs + stop reasons).
- Donât build a custom tracing system if you canât keep it running. Use something boring.
Copy-paste checklist
- [ ]
run_id,trace_id,request_id,user_idon every event - [ ]
tool_call+tool_resultevents (name, args_hash, latency, ok, error_class) - [ ]
stop_reason+ budgets at end of run - [ ] Redaction policy (PII, secrets) + default to storing hashes
- [ ] Alerts: spikes in tool calls/run, timeouts, validation fails
- [ ] One âincident queryâ per top failure (saved search / dashboard)
Safe default config snippet (YAML)
logging:
ids:
run_id: required
trace_id: required
request_id: required
tool_calls:
enabled: true
store_args: false
store_args_hash: true
store_results: "sampled" # none|sampled|full
result_sample_rate: 0.01
pii:
redact_fields: ["email", "phone", "token", "authorization", "cookie"]
stop_reasons:
enabled: true
alerts:
tool_calls_per_run_p95: { warn: 10, critical: 20 }
timeout_rate: { warn: 0.02, critical: 0.05 }
validation_fail_rate: { warn: 0.02, critical: 0.05 }
Implement in OnceOnly (optional)
# onceonly-python: governed audit logs + metrics
import os
from onceonly import OnceOnly
client = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"])
agent_id = "support-bot"
# Pull last 50 actions (includes args_hash + decisions)
for e in client.gov.agent_logs(agent_id, limit=50):
print(e.ts, e.tool, e.decision, e.args_hash, e.spend_usd, e.reason)
# Rollups for dashboards/alerts
m = client.gov.agent_metrics(agent_id, period="day")
print("spend_usd=", m.total_spend_usd, "blocked=", m.blocked_actions)
FAQ (3â5)
Used by patterns
Related failures
Q: Should I log raw tool args?
A: Default to no. Log args_hash + safe fields. Flip raw args on only for short incident windows (with redaction), then turn it off again.
Q: Whatâs the single most useful field?
A: A stable run_id/trace_id on every event.
Q: How do I detect loops quickly?
A: Alert on tool_calls/run and repeated (tool, args_hash) within a run. If you havenât read /failures/tool-spam, do that next.
Q: Do I need distributed tracing?
A: If your tools hit other services, yes. Start with trace IDs + spans around tool calls before going fancy.
Related pages (3â6 links)
- Foundations: Tool calling · What makes an agent production-ready
- Failures: Tool spam · Budget explosion
- Governance: Budget controls · Kill switch design
- Testing: Unit testing agents
- Production stack: AI Agent Production Stack