Idea In 30 Seconds
Agent logging answers one simple question: what exactly happened during a run.
To do that, you need structured events correlated with run_id and trace_id.
Without this, incidents usually show only the final answer, but not the path that produced it.
Core Problem
In a regular backend, a few request logs are often enough.
In agent systems, one request can include reasoning, tool calls, retries, and multiple model steps. If you log only the final answer, it becomes hard to see where exactly the system broke.
In production this usually looks like:
- user reports a wrong answer;
- costs or latency rise in waves;
- logs contain an isolated error without run context.
That is why agents need not random logs, but structured event logging across the full run lifecycle.
How It Works
The baseline idea is simple: each important step is logged as a separate structured event.
Minimum for each event:
run_idandtrace_idfor correlation;event(what happened);timestamp;status(ok/error) where relevant;- key step fields (tool, latency, stop_reason, etc.).
Which events to log first
| Event | What to record |
|---|---|
| run_started | run_id, trace_id, request_id, user_id |
| agent_step | step_type, step_index, tool |
| tool_call | tool_name, args_hash |
| tool_result | tool_name, latency_ms, status, error_class |
| llm_result | model, token usage, latency_ms, status |
| run_finished | stop_reason, total_steps, total_latency_ms |
In production systems, raw prompts and raw tool args are usually not written to logs without redaction. Most teams store a hash or an anonymized form instead.
When To Use
Deep logging is not always necessary.
For a simple single-shot scenario, minimal request -> response logs may be enough.
But once you have tools, retries, multiple steps, or higher cost, without structured logging it becomes difficult to:
- debug incidents;
- explain costs;
- configure alerts reliably.
Implementation Example
Below is a simplified structured-logging example for runtime and tool gateway.
In this example raw args are not written to logs: args_hash is used.
In the flow below, agent_step records the step itself, while tool_call and tool_result separately capture tool-call start and result.
import hashlib
import json
import logging
import time
import uuid
logger = logging.getLogger("agent")
def stable_hash(value):
payload = json.dumps(
value,
sort_keys=True,
ensure_ascii=False,
default=str, # for datetime and complex types; in critical systems prefer a stable format (for example ISO 8601)
).encode("utf-8")
return hashlib.sha256(payload).hexdigest()
def log_event(event, **fields):
logger.info(event, extra={"event": event, **fields})
def run_agent(agent, task, user_id=None, request_id=None):
run_id = str(uuid.uuid4())
trace_id = str(uuid.uuid4())
started_at = time.time()
steps = 0
stop_reason = "max_steps"
run_status = "ok"
log_event(
"run_started",
run_id=run_id,
trace_id=trace_id,
user_id=user_id,
request_id=request_id,
task_hash=stable_hash(task),
)
try:
for step in agent.iter(task): # step: reasoning or tool execution
steps += 1
step_started_at = time.time()
step_type = step.type
tool_name = getattr(step, "tool_name", None)
log_event(
"agent_step",
run_id=run_id,
trace_id=trace_id,
step_index=steps,
step_type=step_type,
tool=tool_name,
)
if step_type == "tool_call":
args = getattr(step, "args", {})
log_event(
"tool_call",
run_id=run_id,
trace_id=trace_id,
tool=tool_name,
args_hash=stable_hash(args),
)
try:
result = step.execute()
latency_ms = int((time.time() - step_started_at) * 1000)
if step_type == "tool_call":
log_event(
"tool_result",
run_id=run_id,
trace_id=trace_id,
tool=tool_name,
latency_ms=latency_ms,
status="ok",
)
else:
token_usage = getattr(result, "token_usage", None)
log_event(
"llm_result",
run_id=run_id,
trace_id=trace_id,
step_type=step_type,
model=getattr(step, "model", None),
token_usage=token_usage,
latency_ms=latency_ms,
status="ok",
)
except Exception as error:
latency_ms = int((time.time() - step_started_at) * 1000)
result_event = "tool_result" if step_type == "tool_call" else "llm_result"
log_event(
result_event,
run_id=run_id,
trace_id=trace_id,
step_type=step_type,
tool=tool_name,
model=getattr(step, "model", None),
latency_ms=latency_ms,
status="error",
error_class=type(error).__name__,
error_message=str(error),
)
run_status = "error"
stop_reason = "tool_error" if step_type == "tool_call" else "step_error"
raise
if result.is_final:
stop_reason = "completed"
break
finally:
log_event(
"run_finished",
run_id=run_id,
trace_id=trace_id,
status=run_status,
stop_reason=stop_reason,
total_steps=steps,
total_latency_ms=int((time.time() - started_at) * 1000),
)
In production, these events are usually sent to a centralized logging system (for example ELK, Datadog, or ClickHouse), and then used for dashboards and alerts.
This example is enough to:
- find a problematic tool call;
- calculate per-step latency;
- understand why the run stopped.
For example, one JSON log record can look like this:
{
"timestamp": "2026-03-21T15:17:00Z",
"event": "tool_result",
"run_id": "run_9fd2",
"trace_id": "tr_9fd2",
"tool": "search_docs",
"latency_ms": 410,
"status": "ok"
}
Common Mistakes
Even with logging already added, incidents are often hard to investigate because of the mistakes below.
Only final answer is logged
Without intermediate events, you cannot see how the agent reached the result. In this mode, even a simple incident takes too long to investigate.
No stable identifiers (run_id, trace_id)
When events are not correlated, you cannot reconstruct one full run. In production this often turns debugging into manual searching across services.
Raw prompts or raw args are logged without redaction
This is a direct risk of leaking personal or sensitive data. It is safer to log hashes, redacted fields, or anonymized versions.
tool_result and stop_reason are not logged
If tool_result and stop_reason are missing, it is hard to understand what exactly failed.
These gaps often mask tool failure or an early phase of tool spam.
Self-Check
Below is a short checklist for baseline agent logging before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is logging different from tracing?
A: Logging answers "what happened" and records events. Tracing answers "how exactly it happened" through step sequence and links.
Q: What should be logged first if logging is almost missing?
A: Start with basics: run_id, trace_id, run_started, tool_call, tool_result, run_finished, and stop_reason. This is already enough for baseline debugging.
Q: Can prompts be logged fully?
A: By default, better no. In production, prompts often contain sensitive data. Safer options are hashes or redacted versions.
Q: How to know if logging is already sufficient?
A: If you can reconstruct one problematic run and find the failure point in 5-10 minutes, your baseline logging is working.
Related Pages
Next pages on this topic:
- Observability for AI Agents β overall model of traces, logs, and metrics.
- Agent Tracing β how to see one run path step by step.
- Distributed Agent Tracing β how to connect events across services.
- Debugging Agent Runs β practical incident investigation.
- Agent Metrics β which indicators are needed for stable operations.