Semantic logging for agents

Idea In 30 Seconds

Semantic logging for agents means events have not only JSON format, but also stable meaning.

That means equivalent steps in different runs are logged the same way: same event, same key fields, same statuses.

This makes logs usable for search, alerts, analytics, and debugging in production.

Core Problem

Many teams already write structured logs, but this is often not enough.

Across services and agent versions, the same event can have different names and fields: tool_called, call_tool, tool.invoke. As a result, logs exist, but comparing runs is hard.

Semantic logging is a design-level solution for events, not only a technical layer on top of logs.

In production, this usually looks like:

log-system queries return too much noise;
alerts behave inconsistently because event names differ;
during incidents, teams manually map events from multiple formats.

That is why agent systems need a shared event vocabulary and stable field schema.

How It Works

Semantic logging relies on three things:

a consistent event vocabulary (event taxonomy);
stable fields for each event;
normalized values (status, error_class, stop_reason).

event taxonomy is a contract between runtime, logs, dashboards, and alerts. Breaking this contract breaks observability. status usually has a constrained set of values (for example: ok, error, timeout, cancelled; this example uses a simplified set: ok / error).

Semantic logging does not replace tracing, it complements it. It makes events not only visible, but comparable across services and releases. Logging answers "what happened", tracing answers "how it happened", and semantic logging answers "what it means".

Minimal Event Vocabulary

Event	Semantic meaning	Key fields
run_started	a new run started	run_id, trace_id, request_id, task_hash
agent_step	agent moved to the next step	step_index, step_type, actor
tool_call	start of a tool call	tool_name, args_hash
tool_result	tool call result	tool_name, latency_ms, status, error_class
llm_result	model-step result	model, token_usage, latency_ms, status
policy_decision	policy/guardrails decision	rule_id, decision, reason_code
run_finished	run finished	stop_reason, total_steps, total_latency_ms

policy_decision helps you see not only failures, but also blocking causes and guardrail decisions.

event_version lets you evolve event schema without breaking existing dashboards and alerts.

When To Use

Full semantic logging is not always required.

For a simple single-shot scenario without tools and without execution loop, basic logs are often enough.

But semantic logging becomes critical when:

the system has multiple agents or services;
you need stable alerts and dashboards;
behavior must be compared across releases;
incidents must be analyzed fast, without manual event mapping.

Implementation Example

Below is a simplified example of semantic logging in runtime. The idea is simple: log only events from an agreed vocabulary, and normalize field values.

PYTHON

import hashlib
import json
import logging
import time
import uuid
from enum import StrEnum

logger = logging.getLogger("agent")


class EventName(StrEnum):
    RUN_STARTED = "run_started"
    AGENT_STEP = "agent_step"
    TOOL_CALL = "tool_call"
    TOOL_RESULT = "tool_result"
    LLM_RESULT = "llm_result"
    POLICY_DECISION = "policy_decision"
    RUN_FINISHED = "run_finished"


def stable_hash(value):
    # default=str gives baseline compatibility
    # in critical systems, explicit serialization is better (for example ISO 8601)
    payload = json.dumps(
        value,
        sort_keys=True,
        ensure_ascii=False,
        default=str,
    ).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def normalize_status(ok):
    return "ok" if ok else "error"


def normalize_error(error):
    if error is None:
        return None
    return type(error).__name__


def log_semantic(event_name: EventName, **fields):
    logger.info(
        event_name.value,
        extra={
            "event": event_name.value,
            "event_version": 1,
            "timestamp_ms": int(time.time() * 1000),
            **fields,
        },
    )


def run_agent(agent, task, request_id=None):
    run_id = str(uuid.uuid4())
    trace_id = str(uuid.uuid4())
    started_at = time.time()
    step_index = 0
    stop_reason = "max_steps"
    run_status = "ok"

    log_semantic(
        EventName.RUN_STARTED,
        run_id=run_id,
        trace_id=trace_id,
        request_id=request_id,
        task_hash=stable_hash(task),
    )

    try:
        for step in agent.iter(task):  # step: reasoning or tool execution
            step_index += 1
            step_started_at = time.time()
            step_type = step.type
            tool_name = getattr(step, "tool_name", None)

            log_semantic(
                EventName.AGENT_STEP,
                run_id=run_id,
                trace_id=trace_id,
                step_index=step_index,
                step_type=step_type,
                actor=getattr(step, "actor", "agent_runtime"),
            )

            if step_type == "tool_call":
                args = getattr(step, "args", {})
                log_semantic(
                    EventName.TOOL_CALL,
                    run_id=run_id,
                    trace_id=trace_id,
                    step_index=step_index,
                    tool_name=tool_name,
                    args_hash=stable_hash(args),
                )

            try:
                result = step.execute()
                latency_ms = int((time.time() - step_started_at) * 1000)

                if step_type == "tool_call":
                    log_semantic(
                        EventName.TOOL_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        tool_name=tool_name,
                        latency_ms=latency_ms,
                        status=normalize_status(True),
                        error_class=None,
                    )
                else:
                    log_semantic(
                        EventName.LLM_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        model=getattr(step, "model", None),
                        token_usage=getattr(result, "token_usage", None),
                        latency_ms=latency_ms,
                        status=normalize_status(True),
                    )

                # policy_decision is logged after the step
                # (when result or error is known)
                if getattr(step, "policy_decision", None) is not None:
                    decision = step.policy_decision
                    log_semantic(
                        EventName.POLICY_DECISION,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        rule_id=decision.rule_id,
                        decision=decision.value,
                        reason_code=decision.reason_code,
                    )

            except Exception as error:
                latency_ms = int((time.time() - step_started_at) * 1000)
                run_status = "error"

                if step_type == "tool_call":
                    stop_reason = "tool_error"
                    log_semantic(
                        EventName.TOOL_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        tool_name=tool_name,
                        latency_ms=latency_ms,
                        status=normalize_status(False),
                        error_class=normalize_error(error),
                    )
                else:
                    stop_reason = "step_error"
                    log_semantic(
                        EventName.LLM_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        model=getattr(step, "model", None),
                        latency_ms=latency_ms,
                        status=normalize_status(False),
                        error_class=normalize_error(error),
                    )

                if getattr(step, "policy_decision", None) is not None:
                    decision = step.policy_decision
                    log_semantic(
                        EventName.POLICY_DECISION,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        rule_id=decision.rule_id,
                        decision=decision.value,
                        reason_code=decision.reason_code,
                    )

                raise

            if result.is_final:
                stop_reason = "completed"
                break

    finally:
        log_semantic(
            EventName.RUN_FINISHED,
            run_id=run_id,
            trace_id=trace_id,
            status=run_status,
            stop_reason=stop_reason,
            total_steps=step_index,
            total_latency_ms=int((time.time() - started_at) * 1000),
        )

In production, such events are usually sent to centralized logging systems (for example ELK, Datadog, or ClickHouse), where they drive queries, dashboards, and alerts.

For example, one semantic event in JSON can look like this:

JSON

{
  "timestamp_ms": 1774106220000,
  "event": "policy_decision",
  "event_version": 1,
  "run_id": "run_9fd2",
  "trace_id": "tr_9fd2",
  "step_index": 3,
  "rule_id": "email_external_domain",
  "decision": "deny",
  "reason_code": "missing_user_confirmation"
}

Common Mistakes

Even with structured logs already in place, semantic logging often breaks because of common mistakes below.

Events named differently across services

When one action has different event names, log queries become unstable. As a result, it is harder to detect tool failure or early-stage tool spam in time.

Free text instead of normalized fields

Fields like "error": "something failed" are almost useless for analytics. Better to use separate normalized fields like status, error_class, and reason_code.

No `event_version`

Without event versioning, schema changes silently break dashboards, saved queries, and alerts. So schema evolution should be explicit.

Raw prompts or raw args logged without redaction

This is a security and compliance risk. Safer choices are hashes or anonymized field versions.

Self-Check

Below is a short checklist for baseline semantic logging before release.

There is a shared event-type vocabulary across the system
Each event has run_id, trace_id, and timestamp
Events use event_version
status (where relevant), error_class, and reason_code use fixed values
Tool calls include tool_name, args_hash, and latency_ms
LLM events include model, token_usage, and status
stop_reason is logged for every run
Raw prompts and raw args are not stored without redaction
Dashboards and alerts are built on semantic fields

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is semantic logging different from regular JSON logging?
A: JSON logging defines format only. Semantic logging defines meaning: stable event names, stable fields, and normalized values.

Q: Does semantic logging replace tracing?
A: No. Tracing shows execution path, while semantic logging makes events on that path understandable for search, alerts, and analytics.

Q: What is the minimum semantic logging for a first production release?
A: Baseline event vocabulary (run_started, tool_call, tool_result, run_finished), stable run_id/trace_id, status, error_class, and stop_reason.

Q: Do we need to migrate all old logs immediately?
A: No. Start with new events and critical run paths, then migrate legacy formats gradually.

Next pages on this topic:

Observability for AI Agents — overall model of tracing, logging, and metrics.
Agent Logging — which events to capture in runtime.
Agent Tracing — how to see the full path of one run.
Distributed Agent Tracing — how to connect events across services.
Debugging Agent Runs — practical incident analysis.

Semantic logging for agents

Idea In 30 Seconds

Core Problem

How It Works

Minimal Event Vocabulary

When To Use

Implementation Example

Common Mistakes

Events named differently across services

Free text instead of normalized fields

No `event_version`

Raw prompts or raw args logged without redaction

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Semantic logging for agents

Idea In 30 Seconds

Core Problem

How It Works

Minimal Event Vocabulary

When To Use

Implementation Example

Common Mistakes

Events named differently across services

Free text instead of normalized fields

No event_version

Raw prompts or raw args logged without redaction

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note

No `event_version`