Observability for AI Agents: monitoring agent systems

Observability helps track agent behavior through tracing, logs, and metrics.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Tracing
  5. Logging
  6. Metrics
  7. When To Use
  8. Implementation Example
  9. Common Mistakes
  10. Logging only the final answer
  11. No traces for tool calls
  12. No cost metrics
  13. Logging raw prompts
  14. Self-Check
  15. FAQ
  16. Related Pages

Idea In 30 Seconds

Observability for AI agents shows what happens during a single run.

An agent can make dozens of steps: reasoning, tool calls, repeated LLM calls.

Observability makes these steps visible through tracing, logs, and metrics.

Core Problem

In a classic backend, one HTTP request runs predictable code:

TEXT
request β†’ handler β†’ database β†’ response

The number of steps is known in advance, and behavior is easy to track with regular logs.

In AI-agent systems, it works differently: one request can become a run with multiple reasoning steps, tool calls, and repeated iterations.

An agent can make 2 steps or 20. It can call multiple tools, repeat reasoning several times, and spend more tokens than expected.

Without observability, it is hard to answer even basic questions:

  • Why is the request slow?
  • Why did token costs spike?
  • Why does the agent call a tool dozens of times?
  • At which step did the error happen?

For these systems, regular logs are not enough. Observability is needed to see the full execution path and quickly find problematic steps during debugging.

How It Works

Observability for AI agents is based on three signal types: traces, logs, and metrics. Together, they let you see both individual requests and overall system health.

There are specialized observability tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but the core principles are the same regardless of tool.

Tracing

Tracing shows the full execution path of one agent run. Each step is recorded as an event: model reasoning, a tool call, or a result. A trace consists of spans (spans). A trace is the full execution path (run), while a span is one step inside it, for example a tool call or reasoning.

TEXT
run_id: 9fd2

step 1 β€” search_docs       420ms
step 2 β€” summarize         110ms
step 3 β€” generate_answer   890ms

Tracing shows which tools were called and how many steps the agent made. It also shows where a failure or delay happened.

What an agent trace looks like

The easiest way to understand tracing is one real request example.

Example trace for one request:

TEXT
run_id: 9fd2
user_query: "Find recent research about battery recycling"

step 1  llm_reasoning        320ms
        thought: need to search research papers

step 2  tool_call: search    410ms
        query: battery recycling research 2024

step 3  llm_reasoning        180ms
        thought: summarize the most relevant papers

step 4  tool_call: fetch     260ms
        source: arxiv

step 5  llm_generate         720ms
        output: final answer

This trace quickly shows:

  • what steps the agent executed
  • which tools were called
  • how long each step took
  • where a delay or error happened

In production systems, a trace often includes additional data:

  • run_id and trace_id for event correlation
  • latency for each step
  • LLM token usage
  • stop reason (completed, max_steps, tool_error)

Traces are useful not only for debugging. They are also required for evaluations. Without intermediate steps, it is hard to automatically verify whether the agent behaved correctly during the run, not just whether the final answer looked correct.

Logging

Logs capture events during agent execution:

  • run start and run finish
  • tool calls with parameters
  • errors and exceptions
  • agent stop reason

Logs answer the question "what happened". Tracing answers "how exactly it happened".

Metrics

Metrics show system-wide behavior, not one specific request. Typical production metrics for agents:

MetricWhat it shows
run countnumber of runs over time
latency p50/p95response speed
tool calls per runtool load
token usageLLM cost usage
error ratefailure frequency

Metrics are needed for production monitoring and alerting. They make anomalies visible quickly: sudden growth in token usage, latency, or tool-call count.

Most LLM providers (OpenAI, Anthropic) return token usage in responses, so you usually do not need to calculate this manually. Logging these fields is enough.

When To Use

Observability is not always required.

For simple scenarios, basic logging is usually enough. For example, one LLM call without tools and without an execution loop.

But once you have a multi-step run β€” reasoning, tool calls, repeated iterations β€” it becomes difficult to debug, control costs, and explain behavior without observability.

Implementation Example

Below is a simplified instrumentation example in an agent runtime. In real systems, these events are usually sent to a tracing system or observability platform. In most runtime systems, each execution unit is represented as a step: it can be model reasoning or a tool call. This loop (agent.iter) is used in frameworks like LangGraph, CrewAI, and custom runtime implementations.

PYTHON
import logging
import time
import uuid

logger = logging.getLogger("agent")

def run_agent(agent, task):
    run_id = str(uuid.uuid4())

    logger.info("agent_run_started", extra={
        "run_id": run_id,
        "task": task,
    })

    steps = []

    for step in agent.iter(task):  # step: reasoning or tool execution
        step_start = time.time()
        result = step.execute()
        latency = time.time() - step_start

        step_event = {
            "run_id": run_id,
            "step_type": step.type,  # reasoning | tool_call | llm_generate
            "tool": getattr(step, "tool_name", None),
            "latency": latency,
        }

        logger.info("agent_step", extra=step_event)
        steps.append(step_event)

        if result.is_final:
            break

    logger.info("agent_run_finished", extra={
        "run_id": run_id,
        "steps": len(steps),
    })

Tools can be wrapped in a helper to log both successful calls and failures:

PYTHON
def traced_tool(tool_fn):
    def wrapper(*args, **kwargs):
        start = time.time()

        try:
            result = tool_fn(*args, **kwargs)
            logger.info(
                "tool_call",
                extra={
                    "tool": tool_fn.__name__,
                    "latency": time.time() - start,
                    "status": "ok",
                },
            )
            return result
        except Exception as error:
            logger.error(
                "tool_call_failed",
                extra={
                    "tool": tool_fn.__name__,
                    "latency": time.time() - start,
                    "error": str(error),
                },
            )
            raise

    return wrapper

The example above shows baseline instrumentation logic. In production systems, it is usually extended with several practices:

  • trace_id for each run
  • structured logs (JSON)
  • latency and token usage metrics
  • integration with monitoring systems

For example, one structured log entry can look like this:

JSON
{
  "timestamp": "2026-03-21T15:17:00Z",
  "level": "INFO",
  "event": "tool_call",
  "run_id": "9fd2...",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

In real systems, run_id and trace_id should be propagated through the whole call chain. In Python, contextvars is often used for this, so you do not pass identifiers into every function manually.

This helps to see the full execution path and quickly find problematic steps.

Common Mistakes

Even when observability is already added, agent systems often remain hard to diagnose. Most often because of the mistakes below.

Logging only the final answer

If the system logs only the final result, it is impossible to understand how the agent arrived there. That is why it is important to log reasoning steps, tool calls, and stop reason. Without this, even simple post-release incidents are hard to analyze.

No traces for tool calls

Tools are often the slowest part of the system. If tool calls are not in traces, it is hard to understand:

  • which tool introduces delay
  • where exactly the failure happens
  • whether the agent calls the same tool repeatedly

In production, this often masks tool spam or an early phase of cascading failures.

No cost metrics

Agents can silently increase costs through long reasoning loops, repeated LLM calls, or unnecessary tool calls. Without cost metrics, this is often noticed only after a visible bill increase. As a result, the system can quickly move into an expensive and unstable mode.

Logging raw prompts

Prompts can contain personal data, secrets, or internal information. Before writing them to logs, they should be redacted or anonymized. Otherwise, incident logs can leak sensitive data.

Self-Check

Below is a short checklist for baseline production observability.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: What is the difference between tracing, logs, and metrics?
A: Tracing shows one run path step by step. Logs capture events and errors. Metrics show the overall view over time (latency, error rate, token usage).

Q: What should be implemented first if there is no observability yet?
A: Start with the minimum: run_id, structured logs, and per-step latency. Then add tracing for tool calls and key metrics (token usage, tool calls per run, error rate).

Q: Which fields matter most for debugging a problematic run?
A: Minimum: run_id/trace_id, step type, called tool, step latency, stop reason, and error (if any). This is enough to reconstruct the event chain.

Q: Is it mandatory to use external observability tools from day one?
A: No. You can start with your own instrumentation and structured logs. Specialized platforms make more sense when run volume, team count, and need for centralized analysis grow.

Next pages on this topic:

⏱️ 8 min read β€’ Updated March 21, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.