Quick take: Without observability, every agent failure becomes “the model was weird” — an untestable, unfixable diagnosis. You need: tool call traces, stop reasons, cost tracking, and replay capability. This isn’t optional infrastructure.
You'll learn: Minimum monitoring requirements • One unified event schema • Stop reason taxonomy • Replay basics • A concrete incident you can recognize
Without monitoring: users report issues first • debugging by vibes • no replay
With minimal monitoring: detect drift early • debug via traces + stop reasons • replay last runs
Impact: faster incident response + fewer repeated failures (because you can fix root cause)
Problem-first intro
An agent run goes wrong.
A user reports: “it sent the wrong email.”
You open logs and you have:
- The final answer text (maybe)
- A stack trace (maybe)
- Vibes (definitely)
If you can’t answer these five questions, the system isn’t operable:
- Which tools were called (and in what order)?
- With what arguments (or at least args hashes)?
- What came back (or at least snapshot hashes)?
- Which version of model/prompt/tools was running?
- Why did it stop?
That’s not “missing dashboards”. That’s “this system is not operable”.
The 03:00 moment
This is what “no monitoring” feels like:
03:12 — Support: "Agent emailed the wrong customer. Please stop it."
You grep logs and find… nothing you can join.
2026-02-07T03:11:58Z INFO sent email to customer@example.com
2026-02-07T03:11:59Z INFO sent email to customer@example.com
2026-02-07T03:12:01Z WARN http.get 429
2026-02-07T03:12:03Z INFO Agent completed task
No run_id. No step trace. No stop reason. No tool args hash. No model/tool version.
So you do the worst kind of debugging: grep by an email address and pray it’s unique.
Why this fails in production
1) Agents are distributed systems with extra steps
Once an agent calls tools, you’ve built:
- multiple dependencies (HTTP, DB, APIs)
- multiple failure modes (timeouts, 502s, rate limits)
- multiple retries (and retry storms)
If you don’t log each step, you’re debugging by storytelling.
2) “Success rate” hides the interesting failures
Drift shows up as:
- higher tool calls per run (looping, not failing)
- higher tokens per request (the model “explaining” errors)
- longer latency (retries, slow tools)
- different stop reasons (budgets, denials, timeouts)
3) You can’t fix what you can’t replay
If you can’t replay or even reconstruct a run from logs, you can’t trust a “fix”. You’ll just be guessing.
4) Monitoring is part of governance
Budgets, allowlists, and kill switches are useless if you can’t see when they triggered.
Hard invariants (non-negotiables)
- Every run has a
run_id. - Every step has a
step_id. - Every tool call logs: tool name, args hash, duration, status, error class.
- Every run ends with a stop event:
stop_reason. - If you can’t replay (even partially), you can’t trust a fix.
Implementation example (real code)
The common failure here is having two different log formats:
- tool events are structured
- stop events are “special”
That kills joinability.
This sample uses one unified event schema for tool calls and stop events.
from __future__ import annotations
from dataclasses import dataclass, asdict
import hashlib
import json
import time
from typing import Any, Literal
EventKind = Literal["tool_result", "stop"]
def sha(obj: Any) -> str:
raw = json.dumps(obj, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8")
return hashlib.sha256(raw).hexdigest()[:24]
@dataclass(frozen=True)
class Event:
run_id: str
kind: EventKind
ts_ms: int
# optional fields
step_id: int | None = None
tool: str | None = None
args_sha: str | None = None
duration_ms: int | None = None
status: Literal["ok", "error"] | None = None
error: str | None = None
stop_reason: str | None = None
usage: dict[str, Any] | None = None
def log_event(ev: Event) -> None:
print(json.dumps(asdict(ev), ensure_ascii=False))
def call_tool(run_id: str, step_id: int, tool: str, args: dict[str, Any]) -> Any:
started = time.time()
try:
out = tool_impl(tool, args=args) # (pseudo)
dur = int((time.time() - started) * 1000)
log_event(
Event(
run_id=run_id,
kind="tool_result",
ts_ms=int(time.time() * 1000),
step_id=step_id,
tool=tool,
args_sha=sha(args),
duration_ms=dur,
status="ok",
error=None,
)
)
return out
except Exception as e:
dur = int((time.time() - started) * 1000)
log_event(
Event(
run_id=run_id,
kind="tool_result",
ts_ms=int(time.time() * 1000),
step_id=step_id,
tool=tool,
args_sha=sha(args),
duration_ms=dur,
status="error",
error=type(e).__name__,
)
)
raise
def stop(run_id: str, *, reason: str, usage: dict[str, Any]) -> dict[str, Any]:
log_event(
Event(
run_id=run_id,
kind="stop",
ts_ms=int(time.time() * 1000),
stop_reason=reason,
usage=usage,
)
)
return {"status": "stopped", "stop_reason": reason, "usage": usage}import crypto from "node:crypto";
export function sha(obj) {
const raw = JSON.stringify(obj, Object.keys(obj || {}).sort());
return crypto.createHash("sha256").update(raw, "utf8").digest("hex").slice(0, 24);
}
export function logEvent(ev) {
console.log(JSON.stringify(ev));
}
export async function callTool(runId, stepId, tool, args) {
const started = Date.now();
try {
const out = await toolImpl(tool, { args }); // (pseudo)
logEvent({
run_id: runId,
kind: "tool_result",
ts_ms: Date.now(),
step_id: stepId,
tool,
args_sha: sha(args),
duration_ms: Date.now() - started,
status: "ok",
error: null,
});
return out;
} catch (e) {
logEvent({
run_id: runId,
kind: "tool_result",
ts_ms: Date.now(),
step_id: stepId,
tool,
args_sha: sha(args),
duration_ms: Date.now() - started,
status: "error",
error: e?.name || "Error",
});
throw e;
}
}
export function stop(runId, { reason, usage }) {
logEvent({
run_id: runId,
kind: "stop",
ts_ms: Date.now(),
stop_reason: reason,
usage,
});
return { status: "stopped", stop_reason: reason, usage };
}Example failure case (concrete)
🚨 Incident: “Everything is slow” (and we didn’t know why)
Date: 2024-10-08
Duration: 3 days unnoticed, ~2 hours to debug once we added visibility
System: Customer support agent
What actually happened
The http.get tool started returning intermittent 429s/503s.
Our tool layer retried up to 8× per call (previously 2×) without jitter. The agent interpreted those failures as “try a different query” and ended up doing more tool calls per run.
Over 3 days (illustrative numbers, but this pattern is common):
- avg tool calls/run: 4.3 → 11.7
- p95 latency: 2.1s → 8.4s
- spend/run: ~2×
Nothing “crashed”. Success rate stayed ~91%, so the drift looked like “users are impatient” until support escalated.
Root cause (the boring version)
- retries + no jitter → thundering herd
- no stop reasons in logs → “success” masked drift
- no tool-call trace → we couldn’t prove where time/spend went
Fix
- Structured event logs (run_id, step_id, tool, args hash, duration, status)
- Stop reasons surfaced to the caller/UI
- Dashboards + alerts on drift signals (tool calls/run, latency P95, stop reasons)
Dashboards + alerts (examples you can steal)
You don’t need perfect observability. You need useful observability.
PromQL examples (Grafana)
# Tool calls per run (p95)
histogram_quantile(0.95, sum(rate(agent_tool_calls_bucket[5m])) by (le))
# Stop reasons over time
sum(rate(agent_stop_total[10m])) by (stop_reason)
# Latency p95
histogram_quantile(0.95, sum(rate(agent_run_latency_ms_bucket[5m])) by (le))
SQL example (Postgres/BigQuery-style)
-- Alert: tool_calls/run spike vs baseline
SELECT
date_trunc('hour', created_at) AS hour,
avg(tool_calls) AS avg_tool_calls
FROM agent_runs
WHERE created_at > now() - interval '7 days'
GROUP BY 1
HAVING avg(tool_calls) > 2 * (
SELECT avg(tool_calls)
FROM agent_runs
WHERE created_at BETWEEN now() - interval '14 days' AND now() - interval '7 days'
);
Alert rules (plain English)
- If
tool_calls_per_run_p95is 2× baseline for 10 minutes → investigate (and consider kill writes). - If
stop_reason=loop_detectedappears above baseline → investigate (tool spam / bad prompt / outage). - If
stop_reason=tool_timeoutspikes → you have upstream issues, not “model weirdness”.
Trade-offs
- Logging costs money (storage, indexing). Still cheaper than blind incidents.
- You must avoid logging raw PII/secrets. Hash args and redact aggressively.
- Replay requires retention policy + access controls.
When NOT to use
- Don’t build a heavy tracing platform before you have structured logs. Start small.
- Don’t log raw tool args if they contain PII/secrets. Ever.
- Don’t ship agents without stop reasons. You’re creating retry loops.
Copy-paste checklist
- [ ]
run_id/step_idfor every run - [ ] Unified event schema (tool results + stop events)
- [ ] Tool-call logs: tool, args_hash, duration, status, error class
- [ ] Stop reason returned to user + logged
- [ ] Tokens/tool calls/spend per run metrics
- [ ] Dashboards: latency P95, tool_calls/run, stop_reason distribution
- [ ] Replay data: snapshot hashes (with retention + access control)
Safe default config
logging:
events:
enabled: true
schema: "unified"
store_args: false
store_args_hash: true
include: ["run_id", "step_id", "tool", "duration_ms", "status", "error", "stop_reason"]
metrics:
track: ["tokens_per_request", "tool_calls_per_run", "latency_p95", "spend_per_run", "stop_reason"]
retention:
tool_snapshot_days: 14
logs_days: 30
FAQ
Related pages
Production takeaway
What breaks without this
- ❌ You can’t explain incidents
- ❌ Drift looks like “model weirdness”
- ❌ Cost overruns show up after the fact
What works with this
- ✅ You can join, replay, and debug runs
- ✅ Drift becomes a graph, not a debate
- ✅ Kill switches trigger based on real signals
Minimum to ship
- Unified structured logs
- Stop reasons
- Basic metrics + dashboards
- Alerts on drift