Idea In 30 Seconds
Distributed tracing shows one run not inside a single service, but across the full call chain.
In multi-agent systems, a request often goes through gateway, runtime, tools, queues, and LLM providers.
Distributed tracing links these steps via trace_id, span_id, and parent_span_id, so system behavior is visible end-to-end.
Core Problem
When an agent runs across multiple services, logs are usually scattered.
You can separately see gateway errors, tool-service errors, and agent-runtime errors.
But these events are not linked as one run.
Without shared trace context, it is hard to understand that this is the same run.
As a result, even a simple incident becomes a long investigation:
- unclear which service introduced delay;
- unknown where context was lost;
- hard to connect retries across services;
- hard to reconstruct the full path of a problematic run.
That is why multi-agent systems need distributed tracing, not only local tracing inside one runtime.
How It Works
Distributed tracing uses the same trace and span model, but across multiple services.
traceβ the full path of one request across all servicesspanβ one concrete operation in one service
In real systems, these fields are usually based on OpenTelemetry (OTel):
trace_idβ shared identifier for the entire pathspan_idβ identifier of the current stepparent_span_idβ relation to the parent stepservice_nameβ where the step ranoperation_nameβ what the service did
To avoid broken traces, trace context must be passed between services with every call.
Most often this is done via headers (traceparent) or queue-message metadata.
What a distributed trace looks like
The easiest way to understand distributed tracing is one request example.
trace_id: tr_7a31
user_query: "Find vendor invoices for March"
gateway span g1 parent=- 18ms status=ok
agent_runtime span a1 parent=g1 240ms status=ok
tool_service span t1 parent=a1 410ms status=ok
agent_runtime span a2 parent=a1 130ms status=ok
llm_provider span l1 parent=a2 690ms status=ok
stop_reason: completed
This trace shows:
- the full cross-service path;
- which service introduced the highest latency;
- where context broke (if it happened);
- which spans were parent and which were child.
When To Use
Distributed tracing is not always required.
If the system is monolithic and the whole run lives in one process, local tracing is often enough.
But distributed tracing becomes critical when:
- one request passes through multiple services;
- workflow includes queues or async workers;
- multiple agents exchange events;
- you need precise analysis of latency and retries across services.
Implementation Example
Below is a simplified example of how to propagate trace context between gateway and a worker service.
The example uses simplified headers (x-trace-id, x-parent-span-id) to show propagation mechanics.
In production, teams usually use standard W3C traceparent header (via OpenTelemetry) for automatic trace-context propagation across services.
import contextvars
import logging
import time
import uuid
logger = logging.getLogger("distributed-tracing")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)
span_id_ctx = contextvars.ContextVar("span_id", default=None)
def start_span(service_name, operation_name, parent_span_id=None):
span_id = str(uuid.uuid4())
started_at = time.time()
logger.info(
"span_started",
extra={
"trace_id": trace_id_ctx.get(),
"span_id": span_id,
"parent_span_id": parent_span_id,
"service_name": service_name,
"operation_name": operation_name,
},
)
return span_id, started_at
def finish_span(service_name, operation_name, span_id, started_at, status, parent_span_id=None, error=None):
logger.info(
"span_finished",
extra={
"trace_id": trace_id_ctx.get(),
"span_id": span_id,
"parent_span_id": parent_span_id,
"service_name": service_name,
"operation_name": operation_name,
"status": status,
"latency_ms": int((time.time() - started_at) * 1000),
"error": error,
},
)
def inject_context(headers):
headers["x-trace-id"] = trace_id_ctx.get() or ""
headers["x-parent-span-id"] = span_id_ctx.get() or ""
def extract_context(headers):
incoming_trace_id = headers.get("x-trace-id") or str(uuid.uuid4())
incoming_parent_span_id = headers.get("x-parent-span-id")
trace_token = trace_id_ctx.set(incoming_trace_id)
return incoming_parent_span_id, trace_token
def gateway_handle_request():
trace_id = str(uuid.uuid4())
trace_token = trace_id_ctx.set(trace_id)
root_span_id, root_started_at = start_span("gateway", "handle_request", parent_span_id=None)
span_token = span_id_ctx.set(root_span_id)
try:
headers = {}
inject_context(headers)
call_worker_service(headers) # example HTTP/gRPC call to worker_handle_request in another service
finish_span(
"gateway",
"handle_request",
root_span_id,
root_started_at,
status="ok",
parent_span_id=None,
)
except Exception as error:
finish_span(
"gateway",
"handle_request",
root_span_id,
root_started_at,
status="error",
parent_span_id=None,
error=str(error),
)
raise
finally:
span_id_ctx.reset(span_token)
trace_id_ctx.reset(trace_token)
def worker_handle_request(headers):
parent_span_id, trace_token = extract_context(headers)
span_id, started_at = start_span("worker", "process_task", parent_span_id=parent_span_id)
span_token = span_id_ctx.set(span_id)
try:
# ... agent work, tool calls, LLM steps ...
finish_span("worker", "process_task", span_id, started_at, status="ok", parent_span_id=parent_span_id)
except Exception as error:
finish_span(
"worker",
"process_task",
span_id,
started_at,
status="error",
parent_span_id=parent_span_id,
error=str(error),
)
raise
finally:
span_id_ctx.reset(span_token)
trace_id_ctx.reset(trace_token)
Even the manual approach above helps understand baseline distributed-tracing mechanics.
In a real workflow, each service usually creates its own span and then forwards this span_id as parent_span_id for the next hop.
If this step is skipped, the next service starts a new trace and the end-to-end picture breaks.
For example, one span event in JSON logs can look like this:
{
"timestamp": "2026-03-21T15:17:00Z",
"event": "span_finished",
"trace_id": "tr_7a31",
"span_id": "sp_worker_02",
"parent_span_id": "sp_gateway_01",
"service_name": "worker",
"operation_name": "process_task",
"latency_ms": 410,
"status": "ok"
}
Common Mistakes
Even when distributed tracing is already added, typical production issues often remain.
New trace_id in every service
If each service generates its own trace_id, the end-to-end trace breaks into pieces.
In this mode, it is hard to localize the incident root cause across services.
Only trace_id is passed, without span relations
trace_id without span_id and parent_span_id gives only a flat event list.
Without a span tree, it is hard to understand which steps were nested.
Context is lost in async queues
If queue metadata does not contain trace context, parts of workflow fall out of the trace.
These gaps often mask an early phase of partial outage or cascading failures.
Missing service_name and operation_name
Without these fields, you can see that an error happened, but not in which service and operation. This makes debugging significantly slower.
Self-Check
Below is a short checklist for baseline distributed tracing before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is distributed tracing different from regular agent tracing?
A: Agent tracing shows steps inside one runtime. Distributed tracing links steps across multiple services into one end-to-end trace.
Q: What breaks if only trace_id is propagated between services, without span_id and parent_span_id?
A: The trace becomes flat: you can see events belong to one path, but it is hard to understand which steps were nested, which service called which, and where exactly delay or failure happened. trace_id links the global path, while span_id and parent_span_id build the step tree.
Q: How should trace context be passed through queues?
A: Add trace_id, span_id, and parent_span_id to message metadata. Otherwise async steps fall out of the trace.
Q: Is OpenTelemetry mandatory from day one?
A: No. You can start with manual propagation and structured logs, then move to OTel SDK as the system grows.
Related Pages
Next pages on this topic:
- Observability for AI Agents β baseline view of traces, logs, and metrics.
- Agent Tracing β tracing one run inside a service.
- Debugging Agent Runs β how to analyze problematic runs step by step.
- Failure Alerting β how to catch breaks and degradation early.
- Agent Metrics β which metrics are required for production monitoring.