Observability for AI Agents: monitoring agent systems

Idea In 30 Seconds

Observability for AI agents shows what happens during a single run.

An agent can make dozens of steps: reasoning, tool calls, repeated LLM calls.

Observability makes these steps visible through tracing, logs, and metrics.

Core Problem

In a classic backend, one HTTP request runs predictable code:

TEXT

request → handler → database → response

The number of steps is known in advance, and behavior is easy to track with regular logs.

In AI-agent systems, it works differently: one request can become a run with multiple reasoning steps, tool calls, and repeated iterations.

An agent can make 2 steps or 20. It can call multiple tools, repeat reasoning several times, and spend more tokens than expected.

Without observability, it is hard to answer even basic questions:

Why is the request slow?
Why did token costs spike?
Why does the agent call a tool dozens of times?
At which step did the error happen?

For these systems, regular logs are not enough. Observability is needed to see the full execution path and quickly find problematic steps during debugging.

How It Works

Observability for AI agents is based on three signal types: traces, logs, and metrics. Together, they let you see both individual requests and overall system health.

There are specialized observability tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but the core principles are the same regardless of tool.

Tracing

Tracing shows the full execution path of one agent run. Each step is recorded as an event: model reasoning, a tool call, or a result. A trace consists of spans (spans). A trace is the full execution path (run), while a span is one step inside it, for example a tool call or reasoning.

TEXT

run_id: 9fd2

step 1 — search_docs       420ms
step 2 — summarize         110ms
step 3 — generate_answer   890ms

Tracing shows which tools were called and how many steps the agent made. It also shows where a failure or delay happened.

What an agent trace looks like

The easiest way to understand tracing is one real request example.

Example trace for one request:

TEXT

run_id: 9fd2
user_query: "Find recent research about battery recycling"

step 1  llm_reasoning        320ms
        thought: need to search research papers

step 2  tool_call: search    410ms
        query: battery recycling research 2024

step 3  llm_reasoning        180ms
        thought: summarize the most relevant papers

step 4  tool_call: fetch     260ms
        source: arxiv

step 5  llm_generate         720ms
        output: final answer

This trace quickly shows:

what steps the agent executed
which tools were called
how long each step took
where a delay or error happened

In production systems, a trace often includes additional data:

run_id and trace_id for event correlation
latency for each step
LLM token usage
stop reason (completed, max_steps, tool_error)

Traces are useful not only for debugging. They are also required for evaluations. Without intermediate steps, it is hard to automatically verify whether the agent behaved correctly during the run, not just whether the final answer looked correct.

Logging

Logs capture events during agent execution:

run start and run finish
tool calls with parameters
errors and exceptions
agent stop reason

Logs answer the question "what happened". Tracing answers "how exactly it happened".

Metrics

Metrics show system-wide behavior, not one specific request. Typical production metrics for agents:

Metric	What it shows
run count	number of runs over time
latency p50/p95	response speed
tool calls per run	tool load
token usage	LLM cost usage
error rate	failure frequency

Metrics are needed for production monitoring and alerting. They make anomalies visible quickly: sudden growth in token usage, latency, or tool-call count.

Most LLM providers (OpenAI, Anthropic) return token usage in responses, so you usually do not need to calculate this manually. Logging these fields is enough.

When To Use

Observability is not always required.

For simple scenarios, basic logging is usually enough. For example, one LLM call without tools and without an execution loop.

But once you have a multi-step run — reasoning, tool calls, repeated iterations — it becomes difficult to debug, control costs, and explain behavior without observability.

Implementation Example

Below is a simplified instrumentation example in an agent runtime. In real systems, these events are usually sent to a tracing system or observability platform. In most runtime systems, each execution unit is represented as a step: it can be model reasoning or a tool call. This loop (agent.iter) is used in frameworks like LangGraph, CrewAI, and custom runtime implementations.

PYTHON

import logging
import time
import uuid

logger = logging.getLogger("agent")

def run_agent(agent, task):
    run_id = str(uuid.uuid4())

    logger.info("agent_run_started", extra={
        "run_id": run_id,
        "task": task,
    })

    steps = []

    for step in agent.iter(task):  # step: reasoning or tool execution
        step_start = time.time()
        result = step.execute()
        latency = time.time() - step_start

        step_event = {
            "run_id": run_id,
            "step_type": step.type,  # reasoning | tool_call | llm_generate
            "tool": getattr(step, "tool_name", None),
            "latency": latency,
        }

        logger.info("agent_step", extra=step_event)
        steps.append(step_event)

        if result.is_final:
            break

    logger.info("agent_run_finished", extra={
        "run_id": run_id,
        "steps": len(steps),
    })

Tools can be wrapped in a helper to log both successful calls and failures:

PYTHON

def traced_tool(tool_fn):
    def wrapper(*args, **kwargs):
        start = time.time()

        try:
            result = tool_fn(*args, **kwargs)
            logger.info(
                "tool_call",
                extra={
                    "tool": tool_fn.__name__,
                    "latency": time.time() - start,
                    "status": "ok",
                },
            )
            return result
        except Exception as error:
            logger.error(
                "tool_call_failed",
                extra={
                    "tool": tool_fn.__name__,
                    "latency": time.time() - start,
                    "error": str(error),
                },
            )
            raise

    return wrapper

The example above shows baseline instrumentation logic. In production systems, it is usually extended with several practices:

trace_id for each run
structured logs (JSON)
latency and token usage metrics
integration with monitoring systems

For example, one structured log entry can look like this:

JSON

{
  "timestamp": "2026-03-21T15:17:00Z",
  "level": "INFO",
  "event": "tool_call",
  "run_id": "9fd2...",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

In real systems, run_id and trace_id should be propagated through the whole call chain. In Python, contextvars is often used for this, so you do not pass identifiers into every function manually.

This helps to see the full execution path and quickly find problematic steps.

Common Mistakes

Even when observability is already added, agent systems often remain hard to diagnose. Most often because of the mistakes below.

Logging only the final answer

If the system logs only the final result, it is impossible to understand how the agent arrived there. That is why it is important to log reasoning steps, tool calls, and stop reason. Without this, even simple post-release incidents are hard to analyze.

No traces for tool calls

Tools are often the slowest part of the system. If tool calls are not in traces, it is hard to understand:

which tool introduces delay
where exactly the failure happens
whether the agent calls the same tool repeatedly

In production, this often masks tool spam or an early phase of cascading failures.

No cost metrics

Agents can silently increase costs through long reasoning loops, repeated LLM calls, or unnecessary tool calls. Without cost metrics, this is often noticed only after a visible bill increase. As a result, the system can quickly move into an expensive and unstable mode.

Logging raw prompts

Prompts can contain personal data, secrets, or internal information. Before writing them to logs, they should be redacted or anonymized. Otherwise, incident logs can leak sensitive data.

Self-Check

Below is a short checklist for baseline production observability.

Each run has a unique run_id or trace_id
All agent steps are logged (reasoning, tool calls, LLM steps)
Tool calls are traced with latency and status (ok / error)
Stop reason is logged for every run
There are metrics: latency, token usage, tool calls per run, error rate
Metrics are split by versions or releases
There are alerts for sharp growth in latency, token usage, or error rate
Logs are structured (JSON) and searchable
Prompts are not logged raw or are anonymized

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: What is the difference between tracing, logs, and metrics?
A: Tracing shows one run path step by step. Logs capture events and errors. Metrics show the overall view over time (latency, error rate, token usage).

Q: What should be implemented first if there is no observability yet?
A: Start with the minimum: run_id, structured logs, and per-step latency. Then add tracing for tool calls and key metrics (token usage, tool calls per run, error rate).

Q: Which fields matter most for debugging a problematic run?
A: Minimum: run_id/trace_id, step type, called tool, step latency, stop reason, and error (if any). This is enough to reconstruct the event chain.

Q: Is it mandatory to use external observability tools from day one?
A: No. You can start with your own instrumentation and structured logs. Specialized platforms make more sense when run volume, team count, and need for centralized analysis grow.

Next pages on this topic:

Agent Tracing — how to track one run execution.
Distributed Agent Tracing — tracing across multiple agents.
Agent Metrics — which indicators to measure.

Observability for AI Agents: monitoring agent systems

Idea In 30 Seconds

Core Problem

How It Works

Tracing

What an agent trace looks like

Logging

Metrics

When To Use

Implementation Example

Common Mistakes

Logging only the final answer

No traces for tool calls

No cost metrics

Logging raw prompts

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Observability for AI Agents: monitoring agent systems

Idea In 30 Seconds

Core Problem

How It Works

Tracing

What an agent trace looks like

Logging

Metrics

When To Use

Implementation Example

Common Mistakes

Logging only the final answer

No traces for tool calls

No cost metrics

Logging raw prompts

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note