Semantic logging for agents

Semantic logging for agents: consistent event taxonomy, structured fields, and queryable traces for debugging and governance audits.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Minimal Event Vocabulary
  5. When To Use
  6. Implementation Example
  7. Common Mistakes
  8. Events named differently across services
  9. Free text instead of normalized fields
  10. No event_version
  11. Raw prompts or raw args logged without redaction
  12. Self-Check
  13. FAQ
  14. Related Pages

Idea In 30 Seconds

Semantic logging for agents means events have not only JSON format, but also stable meaning.

That means equivalent steps in different runs are logged the same way: same event, same key fields, same statuses.

This makes logs usable for search, alerts, analytics, and debugging in production.

Core Problem

Many teams already write structured logs, but this is often not enough.

Across services and agent versions, the same event can have different names and fields: tool_called, call_tool, tool.invoke. As a result, logs exist, but comparing runs is hard.

Semantic logging is a design-level solution for events, not only a technical layer on top of logs.

In production, this usually looks like:

  • log-system queries return too much noise;
  • alerts behave inconsistently because event names differ;
  • during incidents, teams manually map events from multiple formats.

That is why agent systems need a shared event vocabulary and stable field schema.

How It Works

Semantic logging relies on three things:

  • a consistent event vocabulary (event taxonomy);
  • stable fields for each event;
  • normalized values (status, error_class, stop_reason).

event taxonomy is a contract between runtime, logs, dashboards, and alerts. Breaking this contract breaks observability. status usually has a constrained set of values (for example: ok, error, timeout, cancelled; this example uses a simplified set: ok / error).

Semantic logging does not replace tracing, it complements it. It makes events not only visible, but comparable across services and releases. Logging answers "what happened", tracing answers "how it happened", and semantic logging answers "what it means".

Minimal Event Vocabulary

EventSemantic meaningKey fields
run_starteda new run startedrun_id, trace_id, request_id, task_hash
agent_stepagent moved to the next stepstep_index, step_type, actor
tool_callstart of a tool calltool_name, args_hash
tool_resulttool call resulttool_name, latency_ms, status, error_class
llm_resultmodel-step resultmodel, token_usage, latency_ms, status
policy_decisionpolicy/guardrails decisionrule_id, decision, reason_code
run_finishedrun finishedstop_reason, total_steps, total_latency_ms

policy_decision helps you see not only failures, but also blocking causes and guardrail decisions.

event_version lets you evolve event schema without breaking existing dashboards and alerts.

When To Use

Full semantic logging is not always required.

For a simple single-shot scenario without tools and without execution loop, basic logs are often enough.

But semantic logging becomes critical when:

  • the system has multiple agents or services;
  • you need stable alerts and dashboards;
  • behavior must be compared across releases;
  • incidents must be analyzed fast, without manual event mapping.

Implementation Example

Below is a simplified example of semantic logging in runtime. The idea is simple: log only events from an agreed vocabulary, and normalize field values.

PYTHON
import hashlib
import json
import logging
import time
import uuid
from enum import StrEnum

logger = logging.getLogger("agent")


class EventName(StrEnum):
    RUN_STARTED = "run_started"
    AGENT_STEP = "agent_step"
    TOOL_CALL = "tool_call"
    TOOL_RESULT = "tool_result"
    LLM_RESULT = "llm_result"
    POLICY_DECISION = "policy_decision"
    RUN_FINISHED = "run_finished"


def stable_hash(value):
    # default=str gives baseline compatibility
    # in critical systems, explicit serialization is better (for example ISO 8601)
    payload = json.dumps(
        value,
        sort_keys=True,
        ensure_ascii=False,
        default=str,
    ).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def normalize_status(ok):
    return "ok" if ok else "error"


def normalize_error(error):
    if error is None:
        return None
    return type(error).__name__


def log_semantic(event_name: EventName, **fields):
    logger.info(
        event_name.value,
        extra={
            "event": event_name.value,
            "event_version": 1,
            "timestamp_ms": int(time.time() * 1000),
            **fields,
        },
    )


def run_agent(agent, task, request_id=None):
    run_id = str(uuid.uuid4())
    trace_id = str(uuid.uuid4())
    started_at = time.time()
    step_index = 0
    stop_reason = "max_steps"
    run_status = "ok"

    log_semantic(
        EventName.RUN_STARTED,
        run_id=run_id,
        trace_id=trace_id,
        request_id=request_id,
        task_hash=stable_hash(task),
    )

    try:
        for step in agent.iter(task):  # step: reasoning or tool execution
            step_index += 1
            step_started_at = time.time()
            step_type = step.type
            tool_name = getattr(step, "tool_name", None)

            log_semantic(
                EventName.AGENT_STEP,
                run_id=run_id,
                trace_id=trace_id,
                step_index=step_index,
                step_type=step_type,
                actor=getattr(step, "actor", "agent_runtime"),
            )

            if step_type == "tool_call":
                args = getattr(step, "args", {})
                log_semantic(
                    EventName.TOOL_CALL,
                    run_id=run_id,
                    trace_id=trace_id,
                    step_index=step_index,
                    tool_name=tool_name,
                    args_hash=stable_hash(args),
                )

            try:
                result = step.execute()
                latency_ms = int((time.time() - step_started_at) * 1000)

                if step_type == "tool_call":
                    log_semantic(
                        EventName.TOOL_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        tool_name=tool_name,
                        latency_ms=latency_ms,
                        status=normalize_status(True),
                        error_class=None,
                    )
                else:
                    log_semantic(
                        EventName.LLM_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        model=getattr(step, "model", None),
                        token_usage=getattr(result, "token_usage", None),
                        latency_ms=latency_ms,
                        status=normalize_status(True),
                    )

                # policy_decision is logged after the step
                # (when result or error is known)
                if getattr(step, "policy_decision", None) is not None:
                    decision = step.policy_decision
                    log_semantic(
                        EventName.POLICY_DECISION,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        rule_id=decision.rule_id,
                        decision=decision.value,
                        reason_code=decision.reason_code,
                    )

            except Exception as error:
                latency_ms = int((time.time() - step_started_at) * 1000)
                run_status = "error"

                if step_type == "tool_call":
                    stop_reason = "tool_error"
                    log_semantic(
                        EventName.TOOL_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        tool_name=tool_name,
                        latency_ms=latency_ms,
                        status=normalize_status(False),
                        error_class=normalize_error(error),
                    )
                else:
                    stop_reason = "step_error"
                    log_semantic(
                        EventName.LLM_RESULT,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        model=getattr(step, "model", None),
                        latency_ms=latency_ms,
                        status=normalize_status(False),
                        error_class=normalize_error(error),
                    )

                if getattr(step, "policy_decision", None) is not None:
                    decision = step.policy_decision
                    log_semantic(
                        EventName.POLICY_DECISION,
                        run_id=run_id,
                        trace_id=trace_id,
                        step_index=step_index,
                        rule_id=decision.rule_id,
                        decision=decision.value,
                        reason_code=decision.reason_code,
                    )

                raise

            if result.is_final:
                stop_reason = "completed"
                break

    finally:
        log_semantic(
            EventName.RUN_FINISHED,
            run_id=run_id,
            trace_id=trace_id,
            status=run_status,
            stop_reason=stop_reason,
            total_steps=step_index,
            total_latency_ms=int((time.time() - started_at) * 1000),
        )

In production, such events are usually sent to centralized logging systems (for example ELK, Datadog, or ClickHouse), where they drive queries, dashboards, and alerts.

For example, one semantic event in JSON can look like this:

JSON
{
  "timestamp_ms": 1774106220000,
  "event": "policy_decision",
  "event_version": 1,
  "run_id": "run_9fd2",
  "trace_id": "tr_9fd2",
  "step_index": 3,
  "rule_id": "email_external_domain",
  "decision": "deny",
  "reason_code": "missing_user_confirmation"
}

Common Mistakes

Even with structured logs already in place, semantic logging often breaks because of common mistakes below.

Events named differently across services

When one action has different event names, log queries become unstable. As a result, it is harder to detect tool failure or early-stage tool spam in time.

Free text instead of normalized fields

Fields like "error": "something failed" are almost useless for analytics. Better to use separate normalized fields like status, error_class, and reason_code.

No event_version

Without event versioning, schema changes silently break dashboards, saved queries, and alerts. So schema evolution should be explicit.

Raw prompts or raw args logged without redaction

This is a security and compliance risk. Safer choices are hashes or anonymized field versions.

Self-Check

Below is a short checklist for baseline semantic logging before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is semantic logging different from regular JSON logging?
A: JSON logging defines format only. Semantic logging defines meaning: stable event names, stable fields, and normalized values.

Q: Does semantic logging replace tracing?
A: No. Tracing shows execution path, while semantic logging makes events on that path understandable for search, alerts, and analytics.

Q: What is the minimum semantic logging for a first production release?
A: Baseline event vocabulary (run_started, tool_call, tool_result, run_finished), stable run_id/trace_id, status, error_class, and stop_reason.

Q: Do we need to migrate all old logs immediately?
A: No. Start with new events and critical run paths, then migrate legacy formats gradually.

Next pages on this topic:

⏱️ 7 min read β€’ Updated March 20, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.