Idea In 30 Seconds
Debugging agent runs helps move from symptom to cause: what broke, at which step, and why.
To do this, you need to correlate tracing, logs, and metrics of one problematic run.
Without that, teams often see only the final error and miss the full path that led to it.
Core Problem
In agent systems, an incident rarely has a single obvious cause.
The final error can be only a consequence: the real issue may have started earlier, for example from a slow tool call, a bad retry, or a post-release regression. Without systematic debugging, it is hard to localize this quickly.
Next, we break down how to read these signals and consistently find root cause.
In production this often looks like:
- logs contain many events, but no clear sequence;
- the cause is mixed with secondary errors;
- incident was "fixed", but returns after release;
- MTTR grows because the team rebuilds incident context from scratch each time.
That is why debugging a run should be a separate operational process, not a manual search for the "first error".
How It Works
A practical debugging run usually has three levels:
- run context (
run_id,trace_id,release,workflow); - evidence -> analysis (
spans,logs,metrics,stop_reason); - decision (hypothesis -> fix -> verification via replay and tests).
These levels answer: where the issue is, why it happened, and whether the fix actually removes it. Tracing shows path, logs show events, and metrics show scale and trend.
Many logs != fast debugging. Speed does not come from data volume, but from correlating data around one run.
Typical Production Signals For Debugging Runs
| Signal | Where to inspect | Why it matters |
|---|---|---|
| first_error_span | tracing | find where the error appeared first |
| slowest_span | tracing + metrics | bottleneck candidate (must be verified) |
| stop_reason | run_finished log | understand how the run ended |
| error_class | tool_result / llm_result logs | separate timeout from logic failures |
| repeated_tool_calls | tool_call logs + tool metrics | detect repeated calls (loops, retries, tool spam) |
| run_latency_p95 | metrics | check whether incident is already systemic |
| release_diff | release comparison dashboard | detect post-change regression |
| synthetic_run_status | health checks | verify impact on critical workflow |
To keep debugging stable, these signals are usually segmented by release, workflow, model, and tool.
Important: do not add high-cardinality labels (run_id, request_id, user_id) to metrics. Use logs and tracing for those.
How To Read The Debugging Layer
Which run failed -> at which step -> why exactly it happened. These are three levels you should always inspect together.
Focus on trends and release-to-release deltas, not only one emergency event.
Common signal combinations:
first_error_span=tool_call+tool_error_rateup -> issue is in a specific tool layer;run_latency_p95up +tool_latency_p95stable -> likely issue in LLM or runtime logic;repeated_tool_callsup +stop_reason=max_steps-> agent is stuck in a loop;error_rateup after release + positiverelease_diff-> change regression, not one-off incident;synthetic_run_status=fail+health_scoredown -> issue already impacts critical workflow.
When To Use
A formal debugging flow is not always required.
For a simple single-shot scenario without tools, basic logging and manual error inspection may be enough.
But a systematic debugging approach becomes critical when:
- runs contain multiple reasoning steps and tool calls;
- incidents impact latency, cost, or SLO;
- releases are frequent and regressions must be caught fast;
- team has an on-call process and predictable MTTR is required.
Implementation Example
Below is a simplified function that collects evidence for one run and builds a baseline hypothesis. It does not replace full incident tooling, but it shows a practical debugging process.
from collections import Counter
def debug_run(run_id, trace_events, log_events, debug_metrics_snapshot):
run_spans = sorted(
[s for s in trace_events if s.get("run_id") == run_id],
key=lambda s: s.get("started_at_ms", 0),
)
run_logs = [e for e in log_events if e.get("run_id") == run_id]
first_error_span = next((s for s in run_spans if s.get("status") == "error"), None)
# slowest_span may be None if the run has no spans
slowest_span = max(run_spans, key=lambda s: s.get("latency_ms", 0), default=None)
stop_reason = "unknown"
for event in reversed(run_logs):
if event.get("event") == "run_finished":
stop_reason = event.get("stop_reason", "unknown")
break
seen_signatures = set()
repeated_tools = Counter()
for event in run_logs:
if event.get("event") != "tool_call":
continue
signature = (event.get("tool"), event.get("args_hash"))
if signature in seen_signatures:
repeated_tools[event.get("tool")] += 1
else:
seen_signatures.add(signature)
hypotheses = []
if first_error_span and first_error_span.get("step_type") == "tool_call":
hypotheses.append("Likely tool-layer failure: check tool availability and timeout policy.")
if repeated_tools:
hypotheses.append("Repeated tool calls detected: check dedupe/cache and stop conditions.")
if slowest_span and debug_metrics_snapshot.get("run_latency_p95_ms", 0) > debug_metrics_snapshot.get("slo_latency_ms", 2500):
hypotheses.append("p95 latency is above SLO: localize bottleneck via slowest_span.")
if debug_metrics_snapshot.get("release_error_rate_delta", 0) > 0:
hypotheses.append("error_rate increased after release: check prompt/runtime/tool routing changes.")
return {
"run_id": run_id,
"first_error_span": first_error_span,
"slowest_span": slowest_span,
"stop_reason": stop_reason,
"repeated_tools": dict(repeated_tools),
"hypotheses": hypotheses,
}
Debugging is not complete until the issue is reproducible (replay) and you confirm the fix removes it consistently. If the issue cannot be reproduced, debugging moves to hypothesis mode, not evidence mode.
Replay != optional.
Without replay, this is an assumption.
With replay, this is evidence.
This is how a short debugging snapshot can look:
| Run | first_error_span | slowest_span | stop_reason | Conclusion |
|---|---|---|---|---|
| run_9fd2 | tool_call: search_docs | tool_call: search_docs (1.8s) | tool_error | tool degraded + retries |
| run_a113 | llm_generate | llm_generate (2.4s) | step_error | model failure after release |
| run_d77c | β | reasoning (3.1s) | max_steps | loop without explicit error |
Investigation
When an incident signal fires:
- capture
run_id,trace_id,release, and affected workflow; - find
first_error_spanandslowest_spanin tracing; - check
stop_reason,error_class,repeated_tool_callsin logs; - confirm issue scale in metrics (spike or trend) and compare release deltas.
Common Mistakes
Even with observability configured, debugging often breaks because of typical mistakes below.
Starting from an arbitrary log error
Without binding analysis to a specific run_id, teams mix symptoms from different incidents.
In this mode, it is hard to separate a local issue from cascading failure.
No trace + logs + metrics correlation
If tracing, logs, and metrics are inspected separately, hypotheses often contradict each other. Because of this, MTTR grows even for simple tool failures.
Ignoring repeated calls and stop_reason
Without these signals, loops and retry storms are easy to miss. This often hides the early phase of tool spam.
No comparison with previous release
Without release_diff, team does not see whether the issue appeared after changes.
As a result, regression stays in production longer.
Closing an incident without replay and verification
A fix can remove the symptom, not the cause. This raises the risk of repeated partial outage.
Self-Check
Below is a short baseline debugging-flow checklist before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: Where should debugging start for one problematic run?
A: Start with run_id and trace_id: find first_error_span, check stop_reason, then confirm scope in metrics. first_error_span is the fastest way to find the failure point.
Q: What matters more for debugging: tracing or logs?
A: They work together: tracing shows step path, logs provide event details (error_class, args_hash, policy decision).
Q: How do I know this is release regression and not one-off failure?
A: Compare error_rate, latency_p95, repeated_tool_calls between releases. If signal is consistently worse after release, this is regression.
Q: What is minimum data to debug in 10-15 minutes?
A: Minimum: run_id, trace_id, first_error_span, stop_reason, error_class, latency_p95, and release context.
Related Pages
Next on this topic:
- Agent Tracing β how to see one run path step by step.
- Agent Logging β which events are needed for incident analysis.
- Agent Metrics β how to separate one-off failure from trend.
- Agent Health Checks β early degradation signals before incident.
- Agent Failure Alerting β how to trigger investigation on time.