Debugging Agent Runs

How to debug agent runs in production with replayable traces, step history, tool evidence, and reproducible failure triage.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Typical Production Signals For Debugging Runs
  5. How To Read The Debugging Layer
  6. When To Use
  7. Implementation Example
  8. Investigation
  9. Common Mistakes
  10. Starting from an arbitrary log error
  11. No trace + logs + metrics correlation
  12. Ignoring repeated calls and stop_reason
  13. No comparison with previous release
  14. Closing an incident without replay and verification
  15. Self-Check
  16. FAQ
  17. Related Pages

Idea In 30 Seconds

Debugging agent runs helps move from symptom to cause: what broke, at which step, and why.

To do this, you need to correlate tracing, logs, and metrics of one problematic run.

Without that, teams often see only the final error and miss the full path that led to it.

Core Problem

In agent systems, an incident rarely has a single obvious cause.

The final error can be only a consequence: the real issue may have started earlier, for example from a slow tool call, a bad retry, or a post-release regression. Without systematic debugging, it is hard to localize this quickly.

Next, we break down how to read these signals and consistently find root cause.

In production this often looks like:

  • logs contain many events, but no clear sequence;
  • the cause is mixed with secondary errors;
  • incident was "fixed", but returns after release;
  • MTTR grows because the team rebuilds incident context from scratch each time.

That is why debugging a run should be a separate operational process, not a manual search for the "first error".

How It Works

A practical debugging run usually has three levels:

  • run context (run_id, trace_id, release, workflow);
  • evidence -> analysis (spans, logs, metrics, stop_reason);
  • decision (hypothesis -> fix -> verification via replay and tests).

These levels answer: where the issue is, why it happened, and whether the fix actually removes it. Tracing shows path, logs show events, and metrics show scale and trend.

Many logs != fast debugging. Speed does not come from data volume, but from correlating data around one run.

Typical Production Signals For Debugging Runs

SignalWhere to inspectWhy it matters
first_error_spantracingfind where the error appeared first
slowest_spantracing + metricsbottleneck candidate (must be verified)
stop_reasonrun_finished logunderstand how the run ended
error_classtool_result / llm_result logsseparate timeout from logic failures
repeated_tool_callstool_call logs + tool metricsdetect repeated calls (loops, retries, tool spam)
run_latency_p95metricscheck whether incident is already systemic
release_diffrelease comparison dashboarddetect post-change regression
synthetic_run_statushealth checksverify impact on critical workflow

To keep debugging stable, these signals are usually segmented by release, workflow, model, and tool.

Important: do not add high-cardinality labels (run_id, request_id, user_id) to metrics. Use logs and tracing for those.

How To Read The Debugging Layer

Which run failed -> at which step -> why exactly it happened. These are three levels you should always inspect together.

Focus on trends and release-to-release deltas, not only one emergency event.

Common signal combinations:

  • first_error_span=tool_call + tool_error_rate up -> issue is in a specific tool layer;
  • run_latency_p95 up + tool_latency_p95 stable -> likely issue in LLM or runtime logic;
  • repeated_tool_calls up + stop_reason=max_steps -> agent is stuck in a loop;
  • error_rate up after release + positive release_diff -> change regression, not one-off incident;
  • synthetic_run_status=fail + health_score down -> issue already impacts critical workflow.

When To Use

A formal debugging flow is not always required.

For a simple single-shot scenario without tools, basic logging and manual error inspection may be enough.

But a systematic debugging approach becomes critical when:

  • runs contain multiple reasoning steps and tool calls;
  • incidents impact latency, cost, or SLO;
  • releases are frequent and regressions must be caught fast;
  • team has an on-call process and predictable MTTR is required.

Implementation Example

Below is a simplified function that collects evidence for one run and builds a baseline hypothesis. It does not replace full incident tooling, but it shows a practical debugging process.

PYTHON
from collections import Counter


def debug_run(run_id, trace_events, log_events, debug_metrics_snapshot):
    run_spans = sorted(
        [s for s in trace_events if s.get("run_id") == run_id],
        key=lambda s: s.get("started_at_ms", 0),
    )
    run_logs = [e for e in log_events if e.get("run_id") == run_id]

    first_error_span = next((s for s in run_spans if s.get("status") == "error"), None)
    # slowest_span may be None if the run has no spans
    slowest_span = max(run_spans, key=lambda s: s.get("latency_ms", 0), default=None)

    stop_reason = "unknown"
    for event in reversed(run_logs):
        if event.get("event") == "run_finished":
            stop_reason = event.get("stop_reason", "unknown")
            break

    seen_signatures = set()
    repeated_tools = Counter()
    for event in run_logs:
        if event.get("event") != "tool_call":
            continue
        signature = (event.get("tool"), event.get("args_hash"))
        if signature in seen_signatures:
            repeated_tools[event.get("tool")] += 1
        else:
            seen_signatures.add(signature)

    hypotheses = []
    if first_error_span and first_error_span.get("step_type") == "tool_call":
        hypotheses.append("Likely tool-layer failure: check tool availability and timeout policy.")

    if repeated_tools:
        hypotheses.append("Repeated tool calls detected: check dedupe/cache and stop conditions.")

    if slowest_span and debug_metrics_snapshot.get("run_latency_p95_ms", 0) > debug_metrics_snapshot.get("slo_latency_ms", 2500):
        hypotheses.append("p95 latency is above SLO: localize bottleneck via slowest_span.")

    if debug_metrics_snapshot.get("release_error_rate_delta", 0) > 0:
        hypotheses.append("error_rate increased after release: check prompt/runtime/tool routing changes.")

    return {
        "run_id": run_id,
        "first_error_span": first_error_span,
        "slowest_span": slowest_span,
        "stop_reason": stop_reason,
        "repeated_tools": dict(repeated_tools),
        "hypotheses": hypotheses,
    }

Debugging is not complete until the issue is reproducible (replay) and you confirm the fix removes it consistently. If the issue cannot be reproduced, debugging moves to hypothesis mode, not evidence mode.

Insight

Replay != optional.

Without replay, this is an assumption.
With replay, this is evidence.

This is how a short debugging snapshot can look:

Runfirst_error_spanslowest_spanstop_reasonConclusion
run_9fd2tool_call: search_docstool_call: search_docs (1.8s)tool_errortool degraded + retries
run_a113llm_generatellm_generate (2.4s)step_errormodel failure after release
run_d77cβ€”reasoning (3.1s)max_stepsloop without explicit error

Investigation

When an incident signal fires:

  1. capture run_id, trace_id, release, and affected workflow;
  2. find first_error_span and slowest_span in tracing;
  3. check stop_reason, error_class, repeated_tool_calls in logs;
  4. confirm issue scale in metrics (spike or trend) and compare release deltas.

Common Mistakes

Even with observability configured, debugging often breaks because of typical mistakes below.

Starting from an arbitrary log error

Without binding analysis to a specific run_id, teams mix symptoms from different incidents. In this mode, it is hard to separate a local issue from cascading failure.

No trace + logs + metrics correlation

If tracing, logs, and metrics are inspected separately, hypotheses often contradict each other. Because of this, MTTR grows even for simple tool failures.

Ignoring repeated calls and stop_reason

Without these signals, loops and retry storms are easy to miss. This often hides the early phase of tool spam.

No comparison with previous release

Without release_diff, team does not see whether the issue appeared after changes. As a result, regression stays in production longer.

Closing an incident without replay and verification

A fix can remove the symptom, not the cause. This raises the risk of repeated partial outage.

Self-Check

Below is a short baseline debugging-flow checklist before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: Where should debugging start for one problematic run?
A: Start with run_id and trace_id: find first_error_span, check stop_reason, then confirm scope in metrics. first_error_span is the fastest way to find the failure point.

Q: What matters more for debugging: tracing or logs?
A: They work together: tracing shows step path, logs provide event details (error_class, args_hash, policy decision).

Q: How do I know this is release regression and not one-off failure?
A: Compare error_rate, latency_p95, repeated_tool_calls between releases. If signal is consistently worse after release, this is regression.

Q: What is minimum data to debug in 10-15 minutes?
A: Minimum: run_id, trace_id, first_error_span, stop_reason, error_class, latency_p95, and release context.

Next on this topic:

⏱️ 7 min read β€’ Updated March 23, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.