Debugging Agent Runs

Idea In 30 Seconds

Debugging agent runs helps move from symptom to cause: what broke, at which step, and why.

To do this, you need to correlate tracing, logs, and metrics of one problematic run.

Without that, teams often see only the final error and miss the full path that led to it.

Core Problem

In agent systems, an incident rarely has a single obvious cause.

The final error can be only a consequence: the real issue may have started earlier, for example from a slow tool call, a bad retry, or a post-release regression. Without systematic debugging, it is hard to localize this quickly.

Next, we break down how to read these signals and consistently find root cause.

In production this often looks like:

logs contain many events, but no clear sequence;
the cause is mixed with secondary errors;
incident was "fixed", but returns after release;
MTTR grows because the team rebuilds incident context from scratch each time.

That is why debugging a run should be a separate operational process, not a manual search for the "first error".

How It Works

A practical debugging run usually has three levels:

run context (run_id, trace_id, release, workflow);
evidence -> analysis (spans, logs, metrics, stop_reason);
decision (hypothesis -> fix -> verification via replay and tests).

These levels answer: where the issue is, why it happened, and whether the fix actually removes it. Tracing shows path, logs show events, and metrics show scale and trend.

Many logs != fast debugging. Speed does not come from data volume, but from correlating data around one run.

Typical Production Signals For Debugging Runs

Signal	Where to inspect	Why it matters
first_error_span	tracing	find where the error appeared first
slowest_span	tracing + metrics	bottleneck candidate (must be verified)
stop_reason	`run_finished` log	understand how the run ended
error_class	tool_result / llm_result logs	separate timeout from logic failures
repeated_tool_calls	tool_call logs + tool metrics	detect repeated calls (loops, retries, tool spam)
run_latency_p95	metrics	check whether incident is already systemic
release_diff	release comparison dashboard	detect post-change regression
synthetic_run_status	health checks	verify impact on critical workflow

To keep debugging stable, these signals are usually segmented by release, workflow, model, and tool.

Important: do not add high-cardinality labels (run_id, request_id, user_id) to metrics. Use logs and tracing for those.

How To Read The Debugging Layer

Which run failed -> at which step -> why exactly it happened. These are three levels you should always inspect together.

Focus on trends and release-to-release deltas, not only one emergency event.

Common signal combinations:

first_error_span=tool_call + tool_error_rate up -> issue is in a specific tool layer;
run_latency_p95 up + tool_latency_p95 stable -> likely issue in LLM or runtime logic;
repeated_tool_calls up + stop_reason=max_steps -> agent is stuck in a loop;
error_rate up after release + positive release_diff -> change regression, not one-off incident;
synthetic_run_status=fail + health_score down -> issue already impacts critical workflow.

When To Use

A formal debugging flow is not always required.

For a simple single-shot scenario without tools, basic logging and manual error inspection may be enough.

But a systematic debugging approach becomes critical when:

runs contain multiple reasoning steps and tool calls;
incidents impact latency, cost, or SLO;
releases are frequent and regressions must be caught fast;
team has an on-call process and predictable MTTR is required.

Implementation Example

Below is a simplified function that collects evidence for one run and builds a baseline hypothesis. It does not replace full incident tooling, but it shows a practical debugging process.

PYTHON

from collections import Counter


def debug_run(run_id, trace_events, log_events, debug_metrics_snapshot):
    run_spans = sorted(
        [s for s in trace_events if s.get("run_id") == run_id],
        key=lambda s: s.get("started_at_ms", 0),
    )
    run_logs = [e for e in log_events if e.get("run_id") == run_id]

    first_error_span = next((s for s in run_spans if s.get("status") == "error"), None)
    # slowest_span may be None if the run has no spans
    slowest_span = max(run_spans, key=lambda s: s.get("latency_ms", 0), default=None)

    stop_reason = "unknown"
    for event in reversed(run_logs):
        if event.get("event") == "run_finished":
            stop_reason = event.get("stop_reason", "unknown")
            break

    seen_signatures = set()
    repeated_tools = Counter()
    for event in run_logs:
        if event.get("event") != "tool_call":
            continue
        signature = (event.get("tool"), event.get("args_hash"))
        if signature in seen_signatures:
            repeated_tools[event.get("tool")] += 1
        else:
            seen_signatures.add(signature)

    hypotheses = []
    if first_error_span and first_error_span.get("step_type") == "tool_call":
        hypotheses.append("Likely tool-layer failure: check tool availability and timeout policy.")

    if repeated_tools:
        hypotheses.append("Repeated tool calls detected: check dedupe/cache and stop conditions.")

    if slowest_span and debug_metrics_snapshot.get("run_latency_p95_ms", 0) > debug_metrics_snapshot.get("slo_latency_ms", 2500):
        hypotheses.append("p95 latency is above SLO: localize bottleneck via slowest_span.")

    if debug_metrics_snapshot.get("release_error_rate_delta", 0) > 0:
        hypotheses.append("error_rate increased after release: check prompt/runtime/tool routing changes.")

    return {
        "run_id": run_id,
        "first_error_span": first_error_span,
        "slowest_span": slowest_span,
        "stop_reason": stop_reason,
        "repeated_tools": dict(repeated_tools),
        "hypotheses": hypotheses,
    }

Debugging is not complete until the issue is reproducible (replay) and you confirm the fix removes it consistently. If the issue cannot be reproduced, debugging moves to hypothesis mode, not evidence mode.

Insight

Replay != optional.

Without replay, this is an assumption.
With replay, this is evidence.

This is how a short debugging snapshot can look:

Run	first_error_span	slowest_span	stop_reason	Conclusion
run_9fd2	tool_call: search_docs	tool_call: search_docs (1.8s)	tool_error	tool degraded + retries
run_a113	llm_generate	llm_generate (2.4s)	step_error	model failure after release
run_d77c	—	reasoning (3.1s)	max_steps	loop without explicit error

Investigation

When an incident signal fires:

capture run_id, trace_id, release, and affected workflow;
find first_error_span and slowest_span in tracing;
check stop_reason, error_class, repeated_tool_calls in logs;
confirm issue scale in metrics (spike or trend) and compare release deltas.

Common Mistakes

Even with observability configured, debugging often breaks because of typical mistakes below.

Starting from an arbitrary log error

Without binding analysis to a specific run_id, teams mix symptoms from different incidents. In this mode, it is hard to separate a local issue from cascading failure.

No trace + logs + metrics correlation

If tracing, logs, and metrics are inspected separately, hypotheses often contradict each other. Because of this, MTTR grows even for simple tool failures.

Ignoring repeated calls and stop_reason

Without these signals, loops and retry storms are easy to miss. This often hides the early phase of tool spam.

No comparison with previous release

Without release_diff, team does not see whether the issue appeared after changes. As a result, regression stays in production longer.

Closing an incident without replay and verification

A fix can remove the symptom, not the cause. This raises the risk of repeated partial outage.

Self-Check

Below is a short baseline debugging-flow checklist before release.

Every incident is linked to run_id, trace_id, and release
There is a quick way to find first_error_span and slowest_span
stop_reason and error_class are logged for each problematic run
repeated_tool_calls and retry_overhead are tracked
Metrics are compared between current and previous release
Synthetic run and health checks are part of debugging process
Replay is executed for the problematic run after a fix
Fix result is validated by test or regression-check
After incidents, alert rules or playbook are updated

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: Where should debugging start for one problematic run?
A: Start with run_id and trace_id: find first_error_span, check stop_reason, then confirm scope in metrics. first_error_span is the fastest way to find the failure point.

Q: What matters more for debugging: tracing or logs?
A: They work together: tracing shows step path, logs provide event details (error_class, args_hash, policy decision).

Q: How do I know this is release regression and not one-off failure?
A: Compare error_rate, latency_p95, repeated_tool_calls between releases. If signal is consistently worse after release, this is regression.

Q: What is minimum data to debug in 10-15 minutes?
A: Minimum: run_id, trace_id, first_error_span, stop_reason, error_class, latency_p95, and release context.

Next on this topic:

Agent Tracing — how to see one run path step by step.
Agent Logging — which events are needed for incident analysis.
Agent Metrics — how to separate one-off failure from trend.
Agent Health Checks — early degradation signals before incident.
Agent Failure Alerting — how to trigger investigation on time.

Debugging Agent Runs

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Signals For Debugging Runs

How To Read The Debugging Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Starting from an arbitrary log error

No trace + logs + metrics correlation

Ignoring repeated calls and stop_reason

No comparison with previous release

Closing an incident without replay and verification

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Debugging Agent Runs

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Signals For Debugging Runs

How To Read The Debugging Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Starting from an arbitrary log error

No trace + logs + metrics correlation

Ignoring repeated calls and stop_reason

No comparison with previous release

Closing an incident without replay and verification

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note