Agent logging

Agent logging that helps during incidents: trace IDs, tool-call events, stop reasons, redaction strategy, and actionable log structure.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Which events to log first
  5. When To Use
  6. Implementation Example
  7. Common Mistakes
  8. Only final answer is logged
  9. No stable identifiers (run_id, trace_id)
  10. Raw prompts or raw args are logged without redaction
  11. tool_result and stop_reason are not logged
  12. Self-Check
  13. FAQ
  14. Related Pages

Idea In 30 Seconds

Agent logging answers one simple question: what exactly happened during a run.

To do that, you need structured events correlated with run_id and trace_id.

Without this, incidents usually show only the final answer, but not the path that produced it.

Core Problem

In a regular backend, a few request logs are often enough.

In agent systems, one request can include reasoning, tool calls, retries, and multiple model steps. If you log only the final answer, it becomes hard to see where exactly the system broke.

In production this usually looks like:

  • user reports a wrong answer;
  • costs or latency rise in waves;
  • logs contain an isolated error without run context.

That is why agents need not random logs, but structured event logging across the full run lifecycle.

How It Works

The baseline idea is simple: each important step is logged as a separate structured event.

Minimum for each event:

  • run_id and trace_id for correlation;
  • event (what happened);
  • timestamp;
  • status (ok / error) where relevant;
  • key step fields (tool, latency, stop_reason, etc.).

Which events to log first

EventWhat to record
run_startedrun_id, trace_id, request_id, user_id
agent_stepstep_type, step_index, tool
tool_calltool_name, args_hash
tool_resulttool_name, latency_ms, status, error_class
llm_resultmodel, token usage, latency_ms, status
run_finishedstop_reason, total_steps, total_latency_ms

In production systems, raw prompts and raw tool args are usually not written to logs without redaction. Most teams store a hash or an anonymized form instead.

When To Use

Deep logging is not always necessary.

For a simple single-shot scenario, minimal request -> response logs may be enough.

But once you have tools, retries, multiple steps, or higher cost, without structured logging it becomes difficult to:

  • debug incidents;
  • explain costs;
  • configure alerts reliably.

Implementation Example

Below is a simplified structured-logging example for runtime and tool gateway. In this example raw args are not written to logs: args_hash is used. In the flow below, agent_step records the step itself, while tool_call and tool_result separately capture tool-call start and result.

PYTHON
import hashlib
import json
import logging
import time
import uuid

logger = logging.getLogger("agent")


def stable_hash(value):
    payload = json.dumps(
        value,
        sort_keys=True,
        ensure_ascii=False,
        default=str,  # for datetime and complex types; in critical systems prefer a stable format (for example ISO 8601)
    ).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def log_event(event, **fields):
    logger.info(event, extra={"event": event, **fields})


def run_agent(agent, task, user_id=None, request_id=None):
    run_id = str(uuid.uuid4())
    trace_id = str(uuid.uuid4())
    started_at = time.time()
    steps = 0
    stop_reason = "max_steps"
    run_status = "ok"

    log_event(
        "run_started",
        run_id=run_id,
        trace_id=trace_id,
        user_id=user_id,
        request_id=request_id,
        task_hash=stable_hash(task),
    )

    try:
        for step in agent.iter(task):  # step: reasoning or tool execution
            steps += 1
            step_started_at = time.time()
            step_type = step.type
            tool_name = getattr(step, "tool_name", None)

            log_event(
                "agent_step",
                run_id=run_id,
                trace_id=trace_id,
                step_index=steps,
                step_type=step_type,
                tool=tool_name,
            )

            if step_type == "tool_call":
                args = getattr(step, "args", {})
                log_event(
                    "tool_call",
                    run_id=run_id,
                    trace_id=trace_id,
                    tool=tool_name,
                    args_hash=stable_hash(args),
                )

            try:
                result = step.execute()
                latency_ms = int((time.time() - step_started_at) * 1000)

                if step_type == "tool_call":
                    log_event(
                        "tool_result",
                        run_id=run_id,
                        trace_id=trace_id,
                        tool=tool_name,
                        latency_ms=latency_ms,
                        status="ok",
                    )
                else:
                    token_usage = getattr(result, "token_usage", None)
                    log_event(
                        "llm_result",
                        run_id=run_id,
                        trace_id=trace_id,
                        step_type=step_type,
                        model=getattr(step, "model", None),
                        token_usage=token_usage,
                        latency_ms=latency_ms,
                        status="ok",
                    )
            except Exception as error:
                latency_ms = int((time.time() - step_started_at) * 1000)
                result_event = "tool_result" if step_type == "tool_call" else "llm_result"
                log_event(
                    result_event,
                    run_id=run_id,
                    trace_id=trace_id,
                    step_type=step_type,
                    tool=tool_name,
                    model=getattr(step, "model", None),
                    latency_ms=latency_ms,
                    status="error",
                    error_class=type(error).__name__,
                    error_message=str(error),
                )
                run_status = "error"
                stop_reason = "tool_error" if step_type == "tool_call" else "step_error"
                raise

            if result.is_final:
                stop_reason = "completed"
                break
    finally:
        log_event(
            "run_finished",
            run_id=run_id,
            trace_id=trace_id,
            status=run_status,
            stop_reason=stop_reason,
            total_steps=steps,
            total_latency_ms=int((time.time() - started_at) * 1000),
        )

In production, these events are usually sent to a centralized logging system (for example ELK, Datadog, or ClickHouse), and then used for dashboards and alerts.

This example is enough to:

  • find a problematic tool call;
  • calculate per-step latency;
  • understand why the run stopped.

For example, one JSON log record can look like this:

JSON
{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "tool_result",
  "run_id": "run_9fd2",
  "trace_id": "tr_9fd2",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even with logging already added, incidents are often hard to investigate because of the mistakes below.

Only final answer is logged

Without intermediate events, you cannot see how the agent reached the result. In this mode, even a simple incident takes too long to investigate.

No stable identifiers (run_id, trace_id)

When events are not correlated, you cannot reconstruct one full run. In production this often turns debugging into manual searching across services.

Raw prompts or raw args are logged without redaction

This is a direct risk of leaking personal or sensitive data. It is safer to log hashes, redacted fields, or anonymized versions.

tool_result and stop_reason are not logged

If tool_result and stop_reason are missing, it is hard to understand what exactly failed. These gaps often mask tool failure or an early phase of tool spam.

Self-Check

Below is a short checklist for baseline agent logging before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is logging different from tracing?
A: Logging answers "what happened" and records events. Tracing answers "how exactly it happened" through step sequence and links.

Q: What should be logged first if logging is almost missing?
A: Start with basics: run_id, trace_id, run_started, tool_call, tool_result, run_finished, and stop_reason. This is already enough for baseline debugging.

Q: Can prompts be logged fully?
A: By default, better no. In production, prompts often contain sensitive data. Safer options are hashes or redacted versions.

Q: How to know if logging is already sufficient?
A: If you can reconstruct one problematic run and find the failure point in 5-10 minutes, your baseline logging is working.

Next pages on this topic:

⏱️ 6 min read β€’ Updated April 9, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.