Distributed tracing for agents: tracing multi-agent systems

Distributed tracing tracks one run across multiple services, queues, tools, and LLM providers while preserving trace context end-to-end.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. What a distributed trace looks like
  5. When To Use
  6. Implementation Example
  7. Common Mistakes
  8. New trace_id in every service
  9. Only trace_id is passed, without span relations
  10. Context is lost in async queues
  11. Missing service_name and operation_name
  12. Self-Check
  13. FAQ
  14. Related Pages

Idea In 30 Seconds

Distributed tracing shows one run not inside a single service, but across the full call chain.

In multi-agent systems, a request often goes through gateway, runtime, tools, queues, and LLM providers.

Distributed tracing links these steps via trace_id, span_id, and parent_span_id, so system behavior is visible end-to-end.

Core Problem

When an agent runs across multiple services, logs are usually scattered.

You can separately see gateway errors, tool-service errors, and agent-runtime errors. But these events are not linked as one run. Without shared trace context, it is hard to understand that this is the same run.

As a result, even a simple incident becomes a long investigation:

  • unclear which service introduced delay;
  • unknown where context was lost;
  • hard to connect retries across services;
  • hard to reconstruct the full path of a problematic run.

That is why multi-agent systems need distributed tracing, not only local tracing inside one runtime.

How It Works

Distributed tracing uses the same trace and span model, but across multiple services.

  • trace β€” the full path of one request across all services
  • span β€” one concrete operation in one service

In real systems, these fields are usually based on OpenTelemetry (OTel):

  • trace_id β€” shared identifier for the entire path
  • span_id β€” identifier of the current step
  • parent_span_id β€” relation to the parent step
  • service_name β€” where the step ran
  • operation_name β€” what the service did

To avoid broken traces, trace context must be passed between services with every call. Most often this is done via headers (traceparent) or queue-message metadata.

What a distributed trace looks like

The easiest way to understand distributed tracing is one request example.

TEXT
trace_id: tr_7a31
user_query: "Find vendor invoices for March"

gateway         span g1   parent=-     18ms   status=ok
agent_runtime   span a1   parent=g1   240ms   status=ok
tool_service    span t1   parent=a1   410ms   status=ok
agent_runtime   span a2   parent=a1   130ms   status=ok
llm_provider    span l1   parent=a2   690ms   status=ok

stop_reason: completed

This trace shows:

  • the full cross-service path;
  • which service introduced the highest latency;
  • where context broke (if it happened);
  • which spans were parent and which were child.

When To Use

Distributed tracing is not always required.

If the system is monolithic and the whole run lives in one process, local tracing is often enough.

But distributed tracing becomes critical when:

  • one request passes through multiple services;
  • workflow includes queues or async workers;
  • multiple agents exchange events;
  • you need precise analysis of latency and retries across services.

Implementation Example

Below is a simplified example of how to propagate trace context between gateway and a worker service. The example uses simplified headers (x-trace-id, x-parent-span-id) to show propagation mechanics. In production, teams usually use standard W3C traceparent header (via OpenTelemetry) for automatic trace-context propagation across services.

PYTHON
import contextvars
import logging
import time
import uuid

logger = logging.getLogger("distributed-tracing")
trace_id_ctx = contextvars.ContextVar("trace_id", default=None)
span_id_ctx = contextvars.ContextVar("span_id", default=None)


def start_span(service_name, operation_name, parent_span_id=None):
    span_id = str(uuid.uuid4())
    started_at = time.time()
    logger.info(
        "span_started",
        extra={
            "trace_id": trace_id_ctx.get(),
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "service_name": service_name,
            "operation_name": operation_name,
        },
    )
    return span_id, started_at


def finish_span(service_name, operation_name, span_id, started_at, status, parent_span_id=None, error=None):
    logger.info(
        "span_finished",
        extra={
            "trace_id": trace_id_ctx.get(),
            "span_id": span_id,
            "parent_span_id": parent_span_id,
            "service_name": service_name,
            "operation_name": operation_name,
            "status": status,
            "latency_ms": int((time.time() - started_at) * 1000),
            "error": error,
        },
    )


def inject_context(headers):
    headers["x-trace-id"] = trace_id_ctx.get() or ""
    headers["x-parent-span-id"] = span_id_ctx.get() or ""


def extract_context(headers):
    incoming_trace_id = headers.get("x-trace-id") or str(uuid.uuid4())
    incoming_parent_span_id = headers.get("x-parent-span-id")
    trace_token = trace_id_ctx.set(incoming_trace_id)
    return incoming_parent_span_id, trace_token


def gateway_handle_request():
    trace_id = str(uuid.uuid4())
    trace_token = trace_id_ctx.set(trace_id)

    root_span_id, root_started_at = start_span("gateway", "handle_request", parent_span_id=None)
    span_token = span_id_ctx.set(root_span_id)

    try:
        headers = {}
        inject_context(headers)
        call_worker_service(headers)  # example HTTP/gRPC call to worker_handle_request in another service
        finish_span(
            "gateway",
            "handle_request",
            root_span_id,
            root_started_at,
            status="ok",
            parent_span_id=None,
        )
    except Exception as error:
        finish_span(
            "gateway",
            "handle_request",
            root_span_id,
            root_started_at,
            status="error",
            parent_span_id=None,
            error=str(error),
        )
        raise
    finally:
        span_id_ctx.reset(span_token)
        trace_id_ctx.reset(trace_token)


def worker_handle_request(headers):
    parent_span_id, trace_token = extract_context(headers)
    span_id, started_at = start_span("worker", "process_task", parent_span_id=parent_span_id)
    span_token = span_id_ctx.set(span_id)

    try:
        # ... agent work, tool calls, LLM steps ...
        finish_span("worker", "process_task", span_id, started_at, status="ok", parent_span_id=parent_span_id)
    except Exception as error:
        finish_span(
            "worker",
            "process_task",
            span_id,
            started_at,
            status="error",
            parent_span_id=parent_span_id,
            error=str(error),
        )
        raise
    finally:
        span_id_ctx.reset(span_token)
        trace_id_ctx.reset(trace_token)

Even the manual approach above helps understand baseline distributed-tracing mechanics.

In a real workflow, each service usually creates its own span and then forwards this span_id as parent_span_id for the next hop. If this step is skipped, the next service starts a new trace and the end-to-end picture breaks.

For example, one span event in JSON logs can look like this:

JSON
{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "span_finished",
  "trace_id": "tr_7a31",
  "span_id": "sp_worker_02",
  "parent_span_id": "sp_gateway_01",
  "service_name": "worker",
  "operation_name": "process_task",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even when distributed tracing is already added, typical production issues often remain.

New trace_id in every service

If each service generates its own trace_id, the end-to-end trace breaks into pieces. In this mode, it is hard to localize the incident root cause across services.

Only trace_id is passed, without span relations

trace_id without span_id and parent_span_id gives only a flat event list. Without a span tree, it is hard to understand which steps were nested.

Context is lost in async queues

If queue metadata does not contain trace context, parts of workflow fall out of the trace. These gaps often mask an early phase of partial outage or cascading failures.

Missing service_name and operation_name

Without these fields, you can see that an error happened, but not in which service and operation. This makes debugging significantly slower.

Self-Check

Below is a short checklist for baseline distributed tracing before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is distributed tracing different from regular agent tracing?
A: Agent tracing shows steps inside one runtime. Distributed tracing links steps across multiple services into one end-to-end trace.

Q: What breaks if only trace_id is propagated between services, without span_id and parent_span_id?
A: The trace becomes flat: you can see events belong to one path, but it is hard to understand which steps were nested, which service called which, and where exactly delay or failure happened. trace_id links the global path, while span_id and parent_span_id build the step tree.

Q: How should trace context be passed through queues?
A: Add trace_id, span_id, and parent_span_id to message metadata. Otherwise async steps fall out of the trace.

Q: Is OpenTelemetry mandatory from day one?
A: No. You can start with manual propagation and structured logs, then move to OTel SDK as the system grows.

Next pages on this topic:

⏱️ 7 min read β€’ Updated March 21, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.