Distributed tracing for agents: tracing multi-agent systems

Idea In 30 Seconds

Distributed tracing shows one run not inside a single service, but across the full call chain.

In multi-agent systems, a request often goes through gateway, runtime, tools, queues, and LLM providers.

Distributed tracing links these steps via trace_id, span_id, and parent_span_id, so system behavior is visible end-to-end.

Core Problem

When an agent runs across multiple services, logs are usually scattered.

You can separately see gateway errors, tool-service errors, and agent-runtime errors. But these events are not linked as one run. Without shared trace context, it is hard to understand that this is the same run.

As a result, even a simple incident becomes a long investigation:

unclear which service introduced delay;
unknown where context was lost;
hard to connect retries across services;
hard to reconstruct the full path of a problematic run.

That is why multi-agent systems need distributed tracing, not only local tracing inside one runtime.

How It Works

Distributed tracing uses the same trace and span model, but across multiple services.

trace — the full path of one request across all services
span — one concrete operation in one service

In real systems, these fields are usually based on OpenTelemetry (OTel):

trace_id — shared identifier for the entire path
span_id — identifier of the current step
parent_span_id — relation to the parent step
service_name — where the step ran
operation_name — what the service did

To avoid broken traces, trace context must be passed between services with every call. Most often this is done via headers (traceparent) or queue-message metadata.

What a distributed trace looks like

The easiest way to understand distributed tracing is one request example.

TEXT

trace_id: tr_7a31
user_query: "Find vendor invoices for March"

gateway         span g1   parent=-     18ms   status=ok
agent_runtime   span a1   parent=g1   240ms   status=ok
tool_service    span t1   parent=a1   410ms   status=ok
agent_runtime   span a2   parent=a1   130ms   status=ok
llm_provider    span l1   parent=a2   690ms   status=ok

stop_reason: completed

This trace shows:

the full cross-service path;
which service introduced the highest latency;
where context broke (if it happened);
which spans were parent and which were child.

When To Use

Distributed tracing is not always required.

If the system is monolithic and the whole run lives in one process, local tracing is often enough.

But distributed tracing becomes critical when:

one request passes through multiple services;
workflow includes queues or async workers;
multiple agents exchange events;
you need precise analysis of latency and retries across services.

Implementation Example

Below is a simplified example of how to propagate trace context between gateway and a worker service. The example uses simplified headers (x-trace-id, x-parent-span-id) to show propagation mechanics. In production, teams usually use standard W3C traceparent header (via OpenTelemetry) for automatic trace-context propagation across services.

PYTHON

import contextvars
import logging
import time
import uuid

logger = logging.getLogger("distributed-tracing")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)
span_id_ctx = contextvars.ContextVar("span_id", default=None)


def start_span(service_name, operation_name, parent_span_id=None):
    span_id = str(uuid.uuid4())
    started_at = time.time()
    logger.info(
        "span_started",
        extra={
            "trace_id": trace_id_ctx.get(),
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "service_name": service_name,
            "operation_name": operation_name,
        },
    )
    return span_id, started_at


def finish_span(service_name, operation_name, span_id, started_at, status, parent_span_id=None, error=None):
    logger.info(
        "span_finished",
        extra={
            "trace_id": trace_id_ctx.get(),
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "service_name": service_name,
            "operation_name": operation_name,
            "status": status,
            "latency_ms": int((time.time() - started_at) * 1000),
            "error": error,
        },
    )


def inject_context(headers):
    headers["x-trace-id"] = trace_id_ctx.get() or ""
    headers["x-parent-span-id"] = span_id_ctx.get() or ""


def extract_context(headers):
    incoming_trace_id = headers.get("x-trace-id") or str(uuid.uuid4())
    incoming_parent_span_id = headers.get("x-parent-span-id")
    trace_token = trace_id_ctx.set(incoming_trace_id)
    return incoming_parent_span_id, trace_token


def gateway_handle_request():
    trace_id = str(uuid.uuid4())
    trace_token = trace_id_ctx.set(trace_id)

    root_span_id, root_started_at = start_span("gateway", "handle_request", parent_span_id=None)
    span_token = span_id_ctx.set(root_span_id)

    try:
        headers = {}
        inject_context(headers)
        call_worker_service(headers)  # example HTTP/gRPC call to worker_handle_request in another service
        finish_span(
            "gateway",
            "handle_request",
            root_span_id,
            root_started_at,
            status="ok",
            parent_span_id=None,
        )
    except Exception as error:
        finish_span(
            "gateway",
            "handle_request",
            root_span_id,
            root_started_at,
            status="error",
            parent_span_id=None,
            error=str(error),
        )
        raise
    finally:
        span_id_ctx.reset(span_token)
        trace_id_ctx.reset(trace_token)


def worker_handle_request(headers):
    parent_span_id, trace_token = extract_context(headers)
    span_id, started_at = start_span("worker", "process_task", parent_span_id=parent_span_id)
    span_token = span_id_ctx.set(span_id)

    try:
        # ... agent work, tool calls, LLM steps ...
        finish_span("worker", "process_task", span_id, started_at, status="ok", parent_span_id=parent_span_id)
    except Exception as error:
        finish_span(
            "worker",
            "process_task",
            span_id,
            started_at,
            status="error",
            parent_span_id=parent_span_id,
            error=str(error),
        )
        raise
    finally:
        span_id_ctx.reset(span_token)
        trace_id_ctx.reset(trace_token)

Even the manual approach above helps understand baseline distributed-tracing mechanics.

In a real workflow, each service usually creates its own span and then forwards this span_id as parent_span_id for the next hop. If this step is skipped, the next service starts a new trace and the end-to-end picture breaks.

For example, one span event in JSON logs can look like this:

JSON

{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "span_finished",
  "trace_id": "tr_7a31",
  "span_id": "sp_worker_02",
  "parent_span_id": "sp_gateway_01",
  "service_name": "worker",
  "operation_name": "process_task",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even when distributed tracing is already added, typical production issues often remain.

New `trace_id` in every service

If each service generates its own trace_id, the end-to-end trace breaks into pieces. In this mode, it is hard to localize the incident root cause across services.

Only `trace_id` is passed, without span relations

trace_id without span_id and parent_span_id gives only a flat event list. Without a span tree, it is hard to understand which steps were nested.

Context is lost in async queues

If queue metadata does not contain trace context, parts of workflow fall out of the trace. These gaps often mask an early phase of partial outage or cascading failures.

Missing `service_name` and `operation_name`

Without these fields, you can see that an error happened, but not in which service and operation. This makes debugging significantly slower.

Self-Check

Below is a short checklist for baseline distributed tracing before release.

trace_id is propagated across all services and agents
span_id and parent_span_id build an execution tree
Each span has service_name and operation_name
HTTP/gRPC calls pass trace context in headers
Queues preserve trace context in message metadata
Stop reason and run status are logged at orchestrator level
Logs are structured (JSON) and filterable by trace_id
Latency and error-rate metrics exist per service
Alerts exist for trace-context breaks and sharp latency growth

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is distributed tracing different from regular agent tracing?
A: Agent tracing shows steps inside one runtime. Distributed tracing links steps across multiple services into one end-to-end trace.

Q: What breaks if only trace_id is propagated between services, without span_id and parent_span_id?
A: The trace becomes flat: you can see events belong to one path, but it is hard to understand which steps were nested, which service called which, and where exactly delay or failure happened. trace_id links the global path, while span_id and parent_span_id build the step tree.

Q: How should trace context be passed through queues?
A: Add trace_id, span_id, and parent_span_id to message metadata. Otherwise async steps fall out of the trace.

Q: Is OpenTelemetry mandatory from day one?
A: No. You can start with manual propagation and structured logs, then move to OTel SDK as the system grows.

Next pages on this topic:

Observability for AI Agents — baseline view of traces, logs, and metrics.
Agent Tracing — tracing one run inside a service.
Debugging Agent Runs — how to analyze problematic runs step by step.
Failure Alerting — how to catch breaks and degradation early.
Agent Metrics — which metrics are required for production monitoring.

Distributed tracing for agents: tracing multi-agent systems

Idea In 30 Seconds

Core Problem

How It Works

What a distributed trace looks like

When To Use

Implementation Example

Common Mistakes

New `trace_id` in every service

Only `trace_id` is passed, without span relations

Context is lost in async queues

Missing `service_name` and `operation_name`

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Distributed tracing for agents: tracing multi-agent systems

Idea In 30 Seconds

Core Problem

How It Works

What a distributed trace looks like

When To Use

Implementation Example

Common Mistakes

New trace_id in every service

Only trace_id is passed, without span relations

Context is lost in async queues

Missing service_name and operation_name

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note

New `trace_id` in every service

Only `trace_id` is passed, without span relations

Missing `service_name` and `operation_name`