Agent tracing: how to track agent decisions

Agent tracing records each agent step as trace and spans, shows reasoning and tool calls, and helps debug problematic runs in production.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. What a single-run trace looks like
  5. When To Use
  6. Implementation Example
  7. Common Mistakes
  8. Trace only at run level, without spans
  9. Missing trace_id in part of events
  10. Tool calls are not traced
  11. No stop_reason and span status
  12. Self-Check
  13. FAQ
  14. Related Pages

Idea In 30 Seconds

Agent tracing shows the full execution path of one run.

A trace is made of spans (spans): each span is one step, for example reasoning, a tool call, or LLM generation.

This gives step-level visibility and makes debugging in production much easier.

Core Problem

In many systems, teams log only run start and run finish.

For agents, this is not enough: between start and final answer there can be dozens of steps. Without tracing, it is hard to understand what exactly the agent did and at which step the issue appeared.

The same request can execute differently: different step count, different tools, different latency.

Without tracing, even basic questions are hard to answer:

  • Which step was the slowest?
  • Why did the agent call a tool again?
  • Where exactly did the error happen?
  • Why did token usage grow in this specific run?

That is why tracing matters: it shows the full run execution path, not only the final result.

How It Works

Tracing has two core entities:

  • trace β€” the full path of one run
  • span β€” one step inside that trace

In practice, a runtime step often maps to one span, but not always. A complex step can contain nested spans, for example a tool call that internally makes multiple HTTP requests or database calls.

Each span usually has baseline fields:

  • trace_id and run_id for correlation
  • span_id (and parent_span_id when needed)
  • step_type (reasoning, tool_call, llm_generate)
  • latency_ms and status (ok / error)

This structure (trace_id, span_id) is based on the OpenTelemetry (OTel) standard used by most modern monitoring systems. The parent_span_id field is part of OTel hierarchical spans and is used to build an execution tree (trace tree). There are specialized tracing tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but these principles are the same regardless of platform.

What a single-run trace looks like

The easiest way to understand tracing is one request example.

In real systems, each span event includes trace_id, span_id, and often parent_span_id. In the example below, these fields are shortened for readability.

TEXT
trace_id: tr_9fd2
run_id: run_9fd2
user_query: "Find recent research about battery recycling"

span 1  llm_reasoning          320ms   status=ok
span 2  tool_call: search      410ms   status=ok
span 3  llm_reasoning          180ms   status=ok
span 4  tool_call: fetch       260ms   status=error

stop_reason: tool_error

This trace immediately shows:

  • what steps the agent executed;
  • which tools were called;
  • how long each step took;
  • where delay or failure happened.

Traces are useful not only for debugging. They are also important for evaluations and automatic validation of intermediate steps: without traces it is hard to verify whether the agent acted correctly, not only whether it produced the right final answer.

When To Use

Tracing is not always required.

For a simple scenario β€” one LLM call without tools and without an execution loop β€” basic logging is often enough.

But if a run has multiple steps, tool calls, or repeated iterations, without tracing it becomes hard to:

  • debug agent behavior;
  • control latency and cost;
  • explain why the system made a specific decision.

Implementation Example

Below is a simplified runtime instrumentation example for trace and spans. This approach is used in LangGraph, CrewAI, and custom agent runtimes. In this example, the full run is also represented as a root span, and agent steps are logged as nested spans.

PYTHON
import contextvars
import logging
import time
import uuid

logger = logging.getLogger("agent")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)


def start_span(run_id, step_type, tool=None, parent_span_id=None):
    span_id = str(uuid.uuid4())
    started_at = time.time()
    logger.info(
        "span_started",
        extra={
            "trace_id": trace_id_ctx.get(),
            "run_id": run_id,
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "step_type": step_type,
            "tool": tool,
        },
    )
    return span_id, started_at


def finish_span(
    run_id,
    span_id,
    step_type,
    started_at,
    status,
    tool=None,
    parent_span_id=None,
    error=None,
):
    logger.info(
        "span_finished",
        extra={
            "trace_id": trace_id_ctx.get(),
            "run_id": run_id,
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "step_type": step_type,
            "tool": tool,
            "status": status,
            "latency_ms": int((time.time() - started_at) * 1000),
            "error": error,
        },
    )


def run_agent(agent, task):
    trace_id = str(uuid.uuid4())
    run_id = str(uuid.uuid4())  # in multi-agent systems one trace_id can include multiple run_id values
    token = trace_id_ctx.set(trace_id)

    logger.info("trace_started", extra={"trace_id": trace_id, "run_id": run_id, "task": task})

    stop_reason = "max_steps"
    step_count = 0
    root_span_id, root_started_at = start_span(run_id, "run", parent_span_id=None)

    try:
        # in this example all steps are children of root span (no deep nesting)
        for step in agent.iter(task):  # step: reasoning or tool execution
            step_count += 1
            step_type = step.type  # reasoning | tool_call | llm_generate
            tool_name = getattr(step, "tool_name", None)

            span_id, started_at = start_span(
                run_id,
                step_type,
                tool=tool_name,
                parent_span_id=root_span_id,
            )

            try:
                result = step.execute()
                finish_span(
                    run_id,
                    span_id,
                    step_type,
                    started_at,
                    status="ok",
                    tool=tool_name,
                    parent_span_id=root_span_id,
                )
            except Exception as error:
                finish_span(
                    run_id,
                    span_id,
                    step_type,
                    started_at,
                    status="error",
                    tool=tool_name,
                    parent_span_id=root_span_id,
                    error=str(error),
                )
                stop_reason = "tool_error"
                raise

            if result.is_final:
                stop_reason = "completed"
                break

    finally:
        if stop_reason == "completed":
            root_status = "ok"
        elif stop_reason == "max_steps":
            root_status = "error"  # simplified for this example
        else:
            root_status = "error"
        finish_span(
            run_id,
            root_span_id,
            "run",
            root_started_at,
            status=root_status,
            error=None if root_status == "ok" else stop_reason,
        )

        logger.info(
            "trace_finished",
            extra={
                "trace_id": trace_id,
                "run_id": run_id,
                "steps": step_count,
                "stop_reason": stop_reason,
            },
        )
        trace_id_ctx.reset(token)

In real systems, trace_id and run_id should be propagated through the full call chain. In Python, teams often use contextvars so identifiers do not need to be passed manually through every function.

For example, one structured log span can look like this:

JSON
{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "span_finished",
  "trace_id": "tr_9fd2",
  "run_id": "run_9fd2",
  "span_id": "sp_21ab",
  "parent_span_id": "sp_root_01",
  "step_type": "tool_call",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even with tracing added, systems often remain hard to diagnose because of the typical mistakes below.

Trace only at run level, without spans

If only run start and run finish are logged, tracing loses most of its value: intermediate steps are invisible, and delay or failure is almost impossible to localize.

Missing trace_id in part of events

When some logs have no trace_id or run_id, events cannot be stitched into one timeline. Because of this, debugging takes much longer even for simple incidents.

Tool calls are not traced

Tools are often the slowest part of a run. If tool calls are missing from trace data, it is hard to find the cause of delays and repeats. In production, this can mask tool failure or tool spam.

No stop_reason and span status

Without stop_reason and status, it is hard to tell whether a run completed successfully or stopped because of limits or errors. As a result, incident reconstruction and alert tuning become harder.

Self-Check

Below is a short checklist for baseline tracing before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is a trace different from regular logs?
A: Logs answer "what happened". A trace shows the sequence of one run and helps explain "how exactly it happened".

Q: What should be implemented first for agent tracing?
A: Minimum: trace_id, run_id, span_id, step type, latency, status, and stop_reason. This is already enough for baseline debugging.

Q: Is it required to connect an external tracing tool immediately?
A: No. You can start with your own instrumentation and JSON logs. External platforms become especially useful when run count and team count grow.

Q: When can full tracing be overkill?
A: For simple single-shot scenarios without tools and without execution loops, baseline logging is often enough. Full tracing becomes especially useful when runs include multiple steps, external tools, or repeated iterations.

Next pages on this topic:

⏱️ 7 min read β€’ Updated March 21, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.