Idea In 30 Seconds
Semantic logging for agents means events have not only JSON format, but also stable meaning.
That means equivalent steps in different runs are logged the same way: same event, same key fields, same statuses.
This makes logs usable for search, alerts, analytics, and debugging in production.
Core Problem
Many teams already write structured logs, but this is often not enough.
Across services and agent versions, the same event can have different names and fields:
tool_called, call_tool, tool.invoke.
As a result, logs exist, but comparing runs is hard.
Semantic logging is a design-level solution for events, not only a technical layer on top of logs.
In production, this usually looks like:
- log-system queries return too much noise;
- alerts behave inconsistently because event names differ;
- during incidents, teams manually map events from multiple formats.
That is why agent systems need a shared event vocabulary and stable field schema.
How It Works
Semantic logging relies on three things:
- a consistent event vocabulary (
event taxonomy); - stable fields for each event;
- normalized values (
status,error_class,stop_reason).
event taxonomy is a contract between runtime, logs, dashboards, and alerts.
Breaking this contract breaks observability.
status usually has a constrained set of values (for example: ok, error, timeout, cancelled; this example uses a simplified set: ok / error).
Semantic logging does not replace tracing, it complements it. It makes events not only visible, but comparable across services and releases. Logging answers "what happened", tracing answers "how it happened", and semantic logging answers "what it means".
Minimal Event Vocabulary
| Event | Semantic meaning | Key fields |
|---|---|---|
| run_started | a new run started | run_id, trace_id, request_id, task_hash |
| agent_step | agent moved to the next step | step_index, step_type, actor |
| tool_call | start of a tool call | tool_name, args_hash |
| tool_result | tool call result | tool_name, latency_ms, status, error_class |
| llm_result | model-step result | model, token_usage, latency_ms, status |
| policy_decision | policy/guardrails decision | rule_id, decision, reason_code |
| run_finished | run finished | stop_reason, total_steps, total_latency_ms |
policy_decision helps you see not only failures, but also blocking causes and guardrail decisions.
event_version lets you evolve event schema without breaking existing dashboards and alerts.
When To Use
Full semantic logging is not always required.
For a simple single-shot scenario without tools and without execution loop, basic logs are often enough.
But semantic logging becomes critical when:
- the system has multiple agents or services;
- you need stable alerts and dashboards;
- behavior must be compared across releases;
- incidents must be analyzed fast, without manual event mapping.
Implementation Example
Below is a simplified example of semantic logging in runtime. The idea is simple: log only events from an agreed vocabulary, and normalize field values.
import hashlib
import json
import logging
import time
import uuid
from enum import StrEnum
logger = logging.getLogger("agent")
class EventName(StrEnum):
RUN_STARTED = "run_started"
AGENT_STEP = "agent_step"
TOOL_CALL = "tool_call"
TOOL_RESULT = "tool_result"
LLM_RESULT = "llm_result"
POLICY_DECISION = "policy_decision"
RUN_FINISHED = "run_finished"
def stable_hash(value):
# default=str gives baseline compatibility
# in critical systems, explicit serialization is better (for example ISO 8601)
payload = json.dumps(
value,
sort_keys=True,
ensure_ascii=False,
default=str,
).encode("utf-8")
return hashlib.sha256(payload).hexdigest()
def normalize_status(ok):
return "ok" if ok else "error"
def normalize_error(error):
if error is None:
return None
return type(error).__name__
def log_semantic(event_name: EventName, **fields):
logger.info(
event_name.value,
extra={
"event": event_name.value,
"event_version": 1,
"timestamp_ms": int(time.time() * 1000),
**fields,
},
)
def run_agent(agent, task, request_id=None):
run_id = str(uuid.uuid4())
trace_id = str(uuid.uuid4())
started_at = time.time()
step_index = 0
stop_reason = "max_steps"
run_status = "ok"
log_semantic(
EventName.RUN_STARTED,
run_id=run_id,
trace_id=trace_id,
request_id=request_id,
task_hash=stable_hash(task),
)
try:
for step in agent.iter(task): # step: reasoning or tool execution
step_index += 1
step_started_at = time.time()
step_type = step.type
tool_name = getattr(step, "tool_name", None)
log_semantic(
EventName.AGENT_STEP,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
step_type=step_type,
actor=getattr(step, "actor", "agent_runtime"),
)
if step_type == "tool_call":
args = getattr(step, "args", {})
log_semantic(
EventName.TOOL_CALL,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
tool_name=tool_name,
args_hash=stable_hash(args),
)
try:
result = step.execute()
latency_ms = int((time.time() - step_started_at) * 1000)
if step_type == "tool_call":
log_semantic(
EventName.TOOL_RESULT,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
tool_name=tool_name,
latency_ms=latency_ms,
status=normalize_status(True),
error_class=None,
)
else:
log_semantic(
EventName.LLM_RESULT,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
model=getattr(step, "model", None),
token_usage=getattr(result, "token_usage", None),
latency_ms=latency_ms,
status=normalize_status(True),
)
# policy_decision is logged after the step
# (when result or error is known)
if getattr(step, "policy_decision", None) is not None:
decision = step.policy_decision
log_semantic(
EventName.POLICY_DECISION,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
rule_id=decision.rule_id,
decision=decision.value,
reason_code=decision.reason_code,
)
except Exception as error:
latency_ms = int((time.time() - step_started_at) * 1000)
run_status = "error"
if step_type == "tool_call":
stop_reason = "tool_error"
log_semantic(
EventName.TOOL_RESULT,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
tool_name=tool_name,
latency_ms=latency_ms,
status=normalize_status(False),
error_class=normalize_error(error),
)
else:
stop_reason = "step_error"
log_semantic(
EventName.LLM_RESULT,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
model=getattr(step, "model", None),
latency_ms=latency_ms,
status=normalize_status(False),
error_class=normalize_error(error),
)
if getattr(step, "policy_decision", None) is not None:
decision = step.policy_decision
log_semantic(
EventName.POLICY_DECISION,
run_id=run_id,
trace_id=trace_id,
step_index=step_index,
rule_id=decision.rule_id,
decision=decision.value,
reason_code=decision.reason_code,
)
raise
if result.is_final:
stop_reason = "completed"
break
finally:
log_semantic(
EventName.RUN_FINISHED,
run_id=run_id,
trace_id=trace_id,
status=run_status,
stop_reason=stop_reason,
total_steps=step_index,
total_latency_ms=int((time.time() - started_at) * 1000),
)
In production, such events are usually sent to centralized logging systems (for example ELK, Datadog, or ClickHouse), where they drive queries, dashboards, and alerts.
For example, one semantic event in JSON can look like this:
{
"timestamp_ms": 1774106220000,
"event": "policy_decision",
"event_version": 1,
"run_id": "run_9fd2",
"trace_id": "tr_9fd2",
"step_index": 3,
"rule_id": "email_external_domain",
"decision": "deny",
"reason_code": "missing_user_confirmation"
}
Common Mistakes
Even with structured logs already in place, semantic logging often breaks because of common mistakes below.
Events named differently across services
When one action has different event names, log queries become unstable.
As a result, it is harder to detect tool failure or early-stage tool spam in time.
Free text instead of normalized fields
Fields like "error": "something failed" are almost useless for analytics.
Better to use separate normalized fields like status, error_class, and reason_code.
No event_version
Without event versioning, schema changes silently break dashboards, saved queries, and alerts. So schema evolution should be explicit.
Raw prompts or raw args logged without redaction
This is a security and compliance risk. Safer choices are hashes or anonymized field versions.
Self-Check
Below is a short checklist for baseline semantic logging before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is semantic logging different from regular JSON logging?
A: JSON logging defines format only. Semantic logging defines meaning: stable event names, stable fields, and normalized values.
Q: Does semantic logging replace tracing?
A: No. Tracing shows execution path, while semantic logging makes events on that path understandable for search, alerts, and analytics.
Q: What is the minimum semantic logging for a first production release?
A: Baseline event vocabulary (run_started, tool_call, tool_result, run_finished), stable run_id/trace_id, status, error_class, and stop_reason.
Q: Do we need to migrate all old logs immediately?
A: No. Start with new events and critical run paths, then migrate legacy formats gradually.
Related Pages
Next pages on this topic:
- Observability for AI Agents β overall model of tracing, logging, and metrics.
- Agent Logging β which events to capture in runtime.
- Agent Tracing β how to see the full path of one run.
- Distributed Agent Tracing β how to connect events across services.
- Debugging Agent Runs β practical incident analysis.