Agent tracing: how to track agent decisions

Idea In 30 Seconds

Agent tracing shows the full execution path of one run.

A trace is made of spans (spans): each span is one step, for example reasoning, a tool call, or LLM generation.

This gives step-level visibility and makes debugging in production much easier.

Core Problem

In many systems, teams log only run start and run finish.

For agents, this is not enough: between start and final answer there can be dozens of steps. Without tracing, it is hard to understand what exactly the agent did and at which step the issue appeared.

The same request can execute differently: different step count, different tools, different latency.

Without tracing, even basic questions are hard to answer:

Which step was the slowest?
Why did the agent call a tool again?
Where exactly did the error happen?
Why did token usage grow in this specific run?

That is why tracing matters: it shows the full run execution path, not only the final result.

How It Works

Tracing has two core entities:

trace — the full path of one run
span — one step inside that trace

In practice, a runtime step often maps to one span, but not always. A complex step can contain nested spans, for example a tool call that internally makes multiple HTTP requests or database calls.

Each span usually has baseline fields:

trace_id and run_id for correlation
span_id (and parent_span_id when needed)
step_type (reasoning, tool_call, llm_generate)
latency_ms and status (ok / error)

This structure (trace_id, span_id) is based on the OpenTelemetry (OTel) standard used by most modern monitoring systems. The parent_span_id field is part of OTel hierarchical spans and is used to build an execution tree (trace tree). There are specialized tracing tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but these principles are the same regardless of platform.

What a single-run trace looks like

The easiest way to understand tracing is one request example.

In real systems, each span event includes trace_id, span_id, and often parent_span_id. In the example below, these fields are shortened for readability.

TEXT

trace_id: tr_9fd2
run_id: run_9fd2
user_query: "Find recent research about battery recycling"

span 1  llm_reasoning          320ms   status=ok
span 2  tool_call: search      410ms   status=ok
span 3  llm_reasoning          180ms   status=ok
span 4  tool_call: fetch       260ms   status=error

stop_reason: tool_error

This trace immediately shows:

what steps the agent executed;
which tools were called;
how long each step took;
where delay or failure happened.

Traces are useful not only for debugging. They are also important for evaluations and automatic validation of intermediate steps: without traces it is hard to verify whether the agent acted correctly, not only whether it produced the right final answer.

When To Use

Tracing is not always required.

For a simple scenario — one LLM call without tools and without an execution loop — basic logging is often enough.

But if a run has multiple steps, tool calls, or repeated iterations, without tracing it becomes hard to:

debug agent behavior;
control latency and cost;
explain why the system made a specific decision.

Implementation Example

Below is a simplified runtime instrumentation example for trace and spans. This approach is used in LangGraph, CrewAI, and custom agent runtimes. In this example, the full run is also represented as a root span, and agent steps are logged as nested spans.

PYTHON

import contextvars
import logging
import time
import uuid

logger = logging.getLogger("agent")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)


def start_span(run_id, step_type, tool=None, parent_span_id=None):
    span_id = str(uuid.uuid4())
    started_at = time.time()
    logger.info(
        "span_started",
        extra={
            "trace_id": trace_id_ctx.get(),
            "run_id": run_id,
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "step_type": step_type,
            "tool": tool,
        },
    )
    return span_id, started_at


def finish_span(
    run_id,
    span_id,
    step_type,
    started_at,
    status,
    tool=None,
    parent_span_id=None,
    error=None,
):
    logger.info(
        "span_finished",
        extra={
            "trace_id": trace_id_ctx.get(),
            "run_id": run_id,
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "step_type": step_type,
            "tool": tool,
            "status": status,
            "latency_ms": int((time.time() - started_at) * 1000),
            "error": error,
        },
    )


def run_agent(agent, task):
    trace_id = str(uuid.uuid4())
    run_id = str(uuid.uuid4())  # in multi-agent systems one trace_id can include multiple run_id values
    token = trace_id_ctx.set(trace_id)

    logger.info("trace_started", extra={"trace_id": trace_id, "run_id": run_id, "task": task})

    stop_reason = "max_steps"
    step_count = 0
    root_span_id, root_started_at = start_span(run_id, "run", parent_span_id=None)

    try:
        # in this example all steps are children of root span (no deep nesting)
        for step in agent.iter(task):  # step: reasoning or tool execution
            step_count += 1
            step_type = step.type  # reasoning | tool_call | llm_generate
            tool_name = getattr(step, "tool_name", None)

            span_id, started_at = start_span(
                run_id,
                step_type,
                tool=tool_name,
                parent_span_id=root_span_id,
            )

            try:
                result = step.execute()
                finish_span(
                    run_id,
                    span_id,
                    step_type,
                    started_at,
                    status="ok",
                    tool=tool_name,
                    parent_span_id=root_span_id,
                )
            except Exception as error:
                finish_span(
                    run_id,
                    span_id,
                    step_type,
                    started_at,
                    status="error",
                    tool=tool_name,
                    parent_span_id=root_span_id,
                    error=str(error),
                )
                stop_reason = "tool_error"
                raise

            if result.is_final:
                stop_reason = "completed"
                break

    finally:
        if stop_reason == "completed":
            root_status = "ok"
        elif stop_reason == "max_steps":
            root_status = "error"  # simplified for this example
        else:
            root_status = "error"
        finish_span(
            run_id,
            root_span_id,
            "run",
            root_started_at,
            status=root_status,
            error=None if root_status == "ok" else stop_reason,
        )

        logger.info(
            "trace_finished",
            extra={
                "trace_id": trace_id,
                "run_id": run_id,
                "steps": step_count,
                "stop_reason": stop_reason,
            },
        )
        trace_id_ctx.reset(token)

In real systems, trace_id and run_id should be propagated through the full call chain. In Python, teams often use contextvars so identifiers do not need to be passed manually through every function.

For example, one structured log span can look like this:

JSON

{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "span_finished",
  "trace_id": "tr_9fd2",
  "run_id": "run_9fd2",
  "span_id": "sp_21ab",
  "parent_span_id": "sp_root_01",
  "step_type": "tool_call",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even with tracing added, systems often remain hard to diagnose because of the typical mistakes below.

Trace only at run level, without spans

If only run start and run finish are logged, tracing loses most of its value: intermediate steps are invisible, and delay or failure is almost impossible to localize.

Missing `trace_id` in part of events

When some logs have no trace_id or run_id, events cannot be stitched into one timeline. Because of this, debugging takes much longer even for simple incidents.

Tool calls are not traced

Tools are often the slowest part of a run. If tool calls are missing from trace data, it is hard to find the cause of delays and repeats. In production, this can mask tool failure or tool spam.

No `stop_reason` and span status

Without stop_reason and status, it is hard to tell whether a run completed successfully or stopped because of limits or errors. As a result, incident reconstruction and alert tuning become harder.

Self-Check

Below is a short checklist for baseline tracing before release.

Each run has trace_id and run_id
Each span has span_id (and parent_span_id for nested steps)
For each span, step_type, latency, and status are logged (ok / error)
Tool calls are traced with tool name and latency
Stop reason is logged for each run
For LLM steps, token usage is logged when available from the provider
Logs are structured (JSON) and searchable
Alerts exist for latency and error-rate anomalies
Events from different services are correlated via trace_id

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is a trace different from regular logs?
A: Logs answer "what happened". A trace shows the sequence of one run and helps explain "how exactly it happened".

Q: What should be implemented first for agent tracing?
A: Minimum: trace_id, run_id, span_id, step type, latency, status, and stop_reason. This is already enough for baseline debugging.

Q: Is it required to connect an external tracing tool immediately?
A: No. You can start with your own instrumentation and JSON logs. External platforms become especially useful when run count and team count grow.

Q: When can full tracing be overkill?
A: For simple single-shot scenarios without tools and without execution loops, baseline logging is often enough. Full tracing becomes especially useful when runs include multiple steps, external tools, or repeated iterations.

Next pages on this topic:

Observability for AI Agents — baseline model of traces, logs, and metrics.
Distributed Agent Tracing — how to link traces across multiple services.
Debugging Agent Runs — how to investigate problematic runs step by step.
Agent Logging — which events to record in logs.
Agent Metrics — which production indicators to track.

Agent tracing: how to track agent decisions

Idea In 30 Seconds

Core Problem

How It Works

What a single-run trace looks like

When To Use

Implementation Example

Common Mistakes

Trace only at run level, without spans

Missing `trace_id` in part of events

Tool calls are not traced

No `stop_reason` and span status

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Agent tracing: how to track agent decisions

Idea In 30 Seconds

Core Problem

How It Works

What a single-run trace looks like

When To Use

Implementation Example

Common Mistakes

Trace only at run level, without spans

Missing trace_id in part of events

Tool calls are not traced

No stop_reason and span status

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note

Missing `trace_id` in part of events

No `stop_reason` and span status