Idea In 30 Seconds
Observability for AI agents shows what happens during a single run.
An agent can make dozens of steps: reasoning, tool calls, repeated LLM calls.
Observability makes these steps visible through tracing, logs, and metrics.
Core Problem
In a classic backend, one HTTP request runs predictable code:
request β handler β database β response
The number of steps is known in advance, and behavior is easy to track with regular logs.
In AI-agent systems, it works differently: one request can become a run with multiple reasoning steps, tool calls, and repeated iterations.
An agent can make 2 steps or 20. It can call multiple tools, repeat reasoning several times, and spend more tokens than expected.
Without observability, it is hard to answer even basic questions:
- Why is the request slow?
- Why did token costs spike?
- Why does the agent call a tool dozens of times?
- At which step did the error happen?
For these systems, regular logs are not enough. Observability is needed to see the full execution path and quickly find problematic steps during debugging.
How It Works
Observability for AI agents is based on three signal types: traces, logs, and metrics. Together, they let you see both individual requests and overall system health.
There are specialized observability tools for agents (for example LangSmith, Langfuse, Arize Phoenix), but the core principles are the same regardless of tool.
Tracing
Tracing shows the full execution path of one agent run.
Each step is recorded as an event: model reasoning, a tool call, or a result.
A trace consists of spans (spans).
A trace is the full execution path (run), while a span is one step inside it, for example a tool call or reasoning.
run_id: 9fd2
step 1 β search_docs 420ms
step 2 β summarize 110ms
step 3 β generate_answer 890ms
Tracing shows which tools were called and how many steps the agent made. It also shows where a failure or delay happened.
What an agent trace looks like
The easiest way to understand tracing is one real request example.
Example trace for one request:
run_id: 9fd2
user_query: "Find recent research about battery recycling"
step 1 llm_reasoning 320ms
thought: need to search research papers
step 2 tool_call: search 410ms
query: battery recycling research 2024
step 3 llm_reasoning 180ms
thought: summarize the most relevant papers
step 4 tool_call: fetch 260ms
source: arxiv
step 5 llm_generate 720ms
output: final answer
This trace quickly shows:
- what steps the agent executed
- which tools were called
- how long each step took
- where a delay or error happened
In production systems, a trace often includes additional data:
- run_id and trace_id for event correlation
- latency for each step
- LLM token usage
- stop reason (
completed,max_steps,tool_error)
Traces are useful not only for debugging. They are also required for evaluations. Without intermediate steps, it is hard to automatically verify whether the agent behaved correctly during the run, not just whether the final answer looked correct.
Logging
Logs capture events during agent execution:
- run start and run finish
- tool calls with parameters
- errors and exceptions
- agent stop reason
Logs answer the question "what happened". Tracing answers "how exactly it happened".
Metrics
Metrics show system-wide behavior, not one specific request. Typical production metrics for agents:
| Metric | What it shows |
|---|---|
| run count | number of runs over time |
| latency p50/p95 | response speed |
| tool calls per run | tool load |
| token usage | LLM cost usage |
| error rate | failure frequency |
Metrics are needed for production monitoring and alerting. They make anomalies visible quickly: sudden growth in token usage, latency, or tool-call count.
Most LLM providers (OpenAI, Anthropic) return token usage in responses, so you usually do not need to calculate this manually. Logging these fields is enough.
When To Use
Observability is not always required.
For simple scenarios, basic logging is usually enough. For example, one LLM call without tools and without an execution loop.
But once you have a multi-step run β reasoning, tool calls, repeated iterations β it becomes difficult to debug, control costs, and explain behavior without observability.
Implementation Example
Below is a simplified instrumentation example in an agent runtime.
In real systems, these events are usually sent to a tracing system or observability platform.
In most runtime systems, each execution unit is represented as a step: it can be model reasoning or a tool call.
This loop (agent.iter) is used in frameworks like LangGraph, CrewAI, and custom runtime implementations.
import logging
import time
import uuid
logger = logging.getLogger("agent")
def run_agent(agent, task):
run_id = str(uuid.uuid4())
logger.info("agent_run_started", extra={
"run_id": run_id,
"task": task,
})
steps = []
for step in agent.iter(task): # step: reasoning or tool execution
step_start = time.time()
result = step.execute()
latency = time.time() - step_start
step_event = {
"run_id": run_id,
"step_type": step.type, # reasoning | tool_call | llm_generate
"tool": getattr(step, "tool_name", None),
"latency": latency,
}
logger.info("agent_step", extra=step_event)
steps.append(step_event)
if result.is_final:
break
logger.info("agent_run_finished", extra={
"run_id": run_id,
"steps": len(steps),
})
Tools can be wrapped in a helper to log both successful calls and failures:
def traced_tool(tool_fn):
def wrapper(*args, **kwargs):
start = time.time()
try:
result = tool_fn(*args, **kwargs)
logger.info(
"tool_call",
extra={
"tool": tool_fn.__name__,
"latency": time.time() - start,
"status": "ok",
},
)
return result
except Exception as error:
logger.error(
"tool_call_failed",
extra={
"tool": tool_fn.__name__,
"latency": time.time() - start,
"error": str(error),
},
)
raise
return wrapper
The example above shows baseline instrumentation logic. In production systems, it is usually extended with several practices:
- trace_id for each run
- structured logs (JSON)
- latency and token usage metrics
- integration with monitoring systems
For example, one structured log entry can look like this:
{
"timestamp": "2026-03-21T15:17:00Z",
"level": "INFO",
"event": "tool_call",
"run_id": "9fd2...",
"tool": "search_docs",
"latency_ms": 410,
"status": "ok"
}
In real systems, run_id and trace_id should be propagated through the whole call chain.
In Python, contextvars is often used for this, so you do not pass identifiers into every function manually.
This helps to see the full execution path and quickly find problematic steps.
Common Mistakes
Even when observability is already added, agent systems often remain hard to diagnose. Most often because of the mistakes below.
Logging only the final answer
If the system logs only the final result, it is impossible to understand how the agent arrived there. That is why it is important to log reasoning steps, tool calls, and stop reason. Without this, even simple post-release incidents are hard to analyze.
No traces for tool calls
Tools are often the slowest part of the system. If tool calls are not in traces, it is hard to understand:
- which tool introduces delay
- where exactly the failure happens
- whether the agent calls the same tool repeatedly
In production, this often masks tool spam or an early phase of cascading failures.
No cost metrics
Agents can silently increase costs through long reasoning loops, repeated LLM calls, or unnecessary tool calls. Without cost metrics, this is often noticed only after a visible bill increase. As a result, the system can quickly move into an expensive and unstable mode.
Logging raw prompts
Prompts can contain personal data, secrets, or internal information. Before writing them to logs, they should be redacted or anonymized. Otherwise, incident logs can leak sensitive data.
Self-Check
Below is a short checklist for baseline production observability.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: What is the difference between tracing, logs, and metrics?
A: Tracing shows one run path step by step. Logs capture events and errors. Metrics show the overall view over time (latency, error rate, token usage).
Q: What should be implemented first if there is no observability yet?
A: Start with the minimum: run_id, structured logs, and per-step latency. Then add tracing for tool calls and key metrics (token usage, tool calls per run, error rate).
Q: Which fields matter most for debugging a problematic run?
A: Minimum: run_id/trace_id, step type, called tool, step latency, stop reason, and error (if any). This is enough to reconstruct the event chain.
Q: Is it mandatory to use external observability tools from day one?
A: No. You can start with your own instrumentation and structured logs. Specialized platforms make more sense when run volume, team count, and need for centralized analysis grow.
Related Pages
Next pages on this topic:
- Agent Tracing β how to track one run execution.
- Distributed Agent Tracing β tracing across multiple agents.
- Agent Metrics β which indicators to measure.