Idea In 30 Seconds
Agent tracing shows the full execution path of one run.
A trace is made of spans (spans): each span is one step, for example reasoning, a tool call, or LLM generation.
This gives step-level visibility and makes debugging in production much easier.
Core Problem
In many systems, teams log only run start and run finish.
For agents, this is not enough: between start and final answer there can be dozens of steps. Without tracing, it is hard to understand what exactly the agent did and at which step the issue appeared.
The same request can execute differently: different step count, different tools, different latency.
Without tracing, even basic questions are hard to answer:
- Which step was the slowest?
- Why did the agent call a tool again?
- Where exactly did the error happen?
- Why did token usage grow in this specific run?
That is why tracing matters: it shows the full run execution path, not only the final result.
How It Works
Tracing has two core entities:
traceβ the full path of one runspanβ one step inside that trace
In practice, a runtime step often maps to one span, but not always.
A complex step can contain nested spans, for example a tool call that internally makes multiple HTTP requests or database calls.
Each span usually has baseline fields:
trace_idandrun_idfor correlationspan_id(andparent_span_idwhen needed)step_type(reasoning,tool_call,llm_generate)latency_msandstatus(ok/error)
This structure (trace_id, span_id) is based on the OpenTelemetry (OTel) standard used by most modern monitoring systems.
The parent_span_id field is part of OTel hierarchical spans and is used to build an execution tree (trace tree).
There are specialized tracing tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but these principles are the same regardless of platform.
What a single-run trace looks like
The easiest way to understand tracing is one request example.
In real systems, each span event includes trace_id, span_id, and often parent_span_id.
In the example below, these fields are shortened for readability.
trace_id: tr_9fd2
run_id: run_9fd2
user_query: "Find recent research about battery recycling"
span 1 llm_reasoning 320ms status=ok
span 2 tool_call: search 410ms status=ok
span 3 llm_reasoning 180ms status=ok
span 4 tool_call: fetch 260ms status=error
stop_reason: tool_error
This trace immediately shows:
- what steps the agent executed;
- which tools were called;
- how long each step took;
- where delay or failure happened.
Traces are useful not only for debugging. They are also important for evaluations and automatic validation of intermediate steps: without traces it is hard to verify whether the agent acted correctly, not only whether it produced the right final answer.
When To Use
Tracing is not always required.
For a simple scenario β one LLM call without tools and without an execution loop β basic logging is often enough.
But if a run has multiple steps, tool calls, or repeated iterations, without tracing it becomes hard to:
- debug agent behavior;
- control latency and cost;
- explain why the system made a specific decision.
Implementation Example
Below is a simplified runtime instrumentation example for trace and spans. This approach is used in LangGraph, CrewAI, and custom agent runtimes. In this example, the full run is also represented as a root span, and agent steps are logged as nested spans.
import contextvars
import logging
import time
import uuid
logger = logging.getLogger("agent")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)
def start_span(run_id, step_type, tool=None, parent_span_id=None):
span_id = str(uuid.uuid4())
started_at = time.time()
logger.info(
"span_started",
extra={
"trace_id": trace_id_ctx.get(),
"run_id": run_id,
"span_id": span_id,
"parent_span_id": parent_span_id,
"step_type": step_type,
"tool": tool,
},
)
return span_id, started_at
def finish_span(
run_id,
span_id,
step_type,
started_at,
status,
tool=None,
parent_span_id=None,
error=None,
):
logger.info(
"span_finished",
extra={
"trace_id": trace_id_ctx.get(),
"run_id": run_id,
"span_id": span_id,
"parent_span_id": parent_span_id,
"step_type": step_type,
"tool": tool,
"status": status,
"latency_ms": int((time.time() - started_at) * 1000),
"error": error,
},
)
def run_agent(agent, task):
trace_id = str(uuid.uuid4())
run_id = str(uuid.uuid4()) # in multi-agent systems one trace_id can include multiple run_id values
token = trace_id_ctx.set(trace_id)
logger.info("trace_started", extra={"trace_id": trace_id, "run_id": run_id, "task": task})
stop_reason = "max_steps"
step_count = 0
root_span_id, root_started_at = start_span(run_id, "run", parent_span_id=None)
try:
# in this example all steps are children of root span (no deep nesting)
for step in agent.iter(task): # step: reasoning or tool execution
step_count += 1
step_type = step.type # reasoning | tool_call | llm_generate
tool_name = getattr(step, "tool_name", None)
span_id, started_at = start_span(
run_id,
step_type,
tool=tool_name,
parent_span_id=root_span_id,
)
try:
result = step.execute()
finish_span(
run_id,
span_id,
step_type,
started_at,
status="ok",
tool=tool_name,
parent_span_id=root_span_id,
)
except Exception as error:
finish_span(
run_id,
span_id,
step_type,
started_at,
status="error",
tool=tool_name,
parent_span_id=root_span_id,
error=str(error),
)
stop_reason = "tool_error"
raise
if result.is_final:
stop_reason = "completed"
break
finally:
if stop_reason == "completed":
root_status = "ok"
elif stop_reason == "max_steps":
root_status = "error" # simplified for this example
else:
root_status = "error"
finish_span(
run_id,
root_span_id,
"run",
root_started_at,
status=root_status,
error=None if root_status == "ok" else stop_reason,
)
logger.info(
"trace_finished",
extra={
"trace_id": trace_id,
"run_id": run_id,
"steps": step_count,
"stop_reason": stop_reason,
},
)
trace_id_ctx.reset(token)
In real systems, trace_id and run_id should be propagated through the full call chain.
In Python, teams often use contextvars so identifiers do not need to be passed manually through every function.
For example, one structured log span can look like this:
{
"timestamp": "2026-03-21T15:17:00Z",
"event": "span_finished",
"trace_id": "tr_9fd2",
"run_id": "run_9fd2",
"span_id": "sp_21ab",
"parent_span_id": "sp_root_01",
"step_type": "tool_call",
"tool": "search_docs",
"latency_ms": 410,
"status": "ok"
}
Common Mistakes
Even with tracing added, systems often remain hard to diagnose because of the typical mistakes below.
Trace only at run level, without spans
If only run start and run finish are logged, tracing loses most of its value: intermediate steps are invisible, and delay or failure is almost impossible to localize.
Missing trace_id in part of events
When some logs have no trace_id or run_id, events cannot be stitched into one timeline.
Because of this, debugging takes much longer even for simple incidents.
Tool calls are not traced
Tools are often the slowest part of a run. If tool calls are missing from trace data, it is hard to find the cause of delays and repeats. In production, this can mask tool failure or tool spam.
No stop_reason and span status
Without stop_reason and status, it is hard to tell whether a run completed successfully or stopped because of limits or errors.
As a result, incident reconstruction and alert tuning become harder.
Self-Check
Below is a short checklist for baseline tracing before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is a trace different from regular logs?
A: Logs answer "what happened". A trace shows the sequence of one run and helps explain "how exactly it happened".
Q: What should be implemented first for agent tracing?
A: Minimum: trace_id, run_id, span_id, step type, latency, status, and stop_reason. This is already enough for baseline debugging.
Q: Is it required to connect an external tracing tool immediately?
A: No. You can start with your own instrumentation and JSON logs. External platforms become especially useful when run count and team count grow.
Q: When can full tracing be overkill?
A: For simple single-shot scenarios without tools and without execution loops, baseline logging is often enough. Full tracing becomes especially useful when runs include multiple steps, external tools, or repeated iterations.
Related Pages
Next pages on this topic:
- Observability for AI Agents β baseline model of traces, logs, and metrics.
- Distributed Agent Tracing β how to link traces across multiple services.
- Debugging Agent Runs β how to investigate problematic runs step by step.
- Agent Logging β which events to record in logs.
- Agent Metrics β which production indicators to track.