Agent logging

Idea In 30 Seconds

Agent logging answers one simple question: what exactly happened during a run.

To do that, you need structured events correlated with run_id and trace_id.

Without this, incidents usually show only the final answer, but not the path that produced it.

Core Problem

In a regular backend, a few request logs are often enough.

In agent systems, one request can include reasoning, tool calls, retries, and multiple model steps. If you log only the final answer, it becomes hard to see where exactly the system broke.

In production this usually looks like:

user reports a wrong answer;
costs or latency rise in waves;
logs contain an isolated error without run context.

That is why agents need not random logs, but structured event logging across the full run lifecycle.

How It Works

The baseline idea is simple: each important step is logged as a separate structured event.

Minimum for each event:

run_id and trace_id for correlation;
event (what happened);
timestamp;
status (ok / error) where relevant;
key step fields (tool, latency, stop_reason, etc.).

Which events to log first

Event	What to record
run_started	run_id, trace_id, request_id, user_id
agent_step	step_type, step_index, tool
tool_call	tool_name, args_hash
tool_result	tool_name, latency_ms, status, error_class
llm_result	model, token usage, latency_ms, status
run_finished	stop_reason, total_steps, total_latency_ms

In production systems, raw prompts and raw tool args are usually not written to logs without redaction. Most teams store a hash or an anonymized form instead.

When To Use

Deep logging is not always necessary.

For a simple single-shot scenario, minimal request -> response logs may be enough.

But once you have tools, retries, multiple steps, or higher cost, without structured logging it becomes difficult to:

debug incidents;
explain costs;
configure alerts reliably.

Implementation Example

Below is a simplified structured-logging example for runtime and tool gateway. In this example raw args are not written to logs: args_hash is used. In the flow below, agent_step records the step itself, while tool_call and tool_result separately capture tool-call start and result.

PYTHON

import hashlib
import json
import logging
import time
import uuid

logger = logging.getLogger("agent")


def stable_hash(value):
    payload = json.dumps(
        value,
        sort_keys=True,
        ensure_ascii=False,
        default=str,  # for datetime and complex types; in critical systems prefer a stable format (for example ISO 8601)
    ).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def log_event(event, **fields):
    logger.info(event, extra={"event": event, **fields})


def run_agent(agent, task, user_id=None, request_id=None):
    run_id = str(uuid.uuid4())
    trace_id = str(uuid.uuid4())
    started_at = time.time()
    steps = 0
    stop_reason = "max_steps"
    run_status = "ok"

    log_event(
        "run_started",
        run_id=run_id,
        trace_id=trace_id,
        user_id=user_id,
        request_id=request_id,
        task_hash=stable_hash(task),
    )

    try:
        for step in agent.iter(task):  # step: reasoning or tool execution
            steps += 1
            step_started_at = time.time()
            step_type = step.type
            tool_name = getattr(step, "tool_name", None)

            log_event(
                "agent_step",
                run_id=run_id,
                trace_id=trace_id,
                step_index=steps,
                step_type=step_type,
                tool=tool_name,
            )

            if step_type == "tool_call":
                args = getattr(step, "args", {})
                log_event(
                    "tool_call",
                    run_id=run_id,
                    trace_id=trace_id,
                    tool=tool_name,
                    args_hash=stable_hash(args),
                )

            try:
                result = step.execute()
                latency_ms = int((time.time() - step_started_at) * 1000)

                if step_type == "tool_call":
                    log_event(
                        "tool_result",
                        run_id=run_id,
                        trace_id=trace_id,
                        tool=tool_name,
                        latency_ms=latency_ms,
                        status="ok",
                    )
                else:
                    token_usage = getattr(result, "token_usage", None)
                    log_event(
                        "llm_result",
                        run_id=run_id,
                        trace_id=trace_id,
                        step_type=step_type,
                        model=getattr(step, "model", None),
                        token_usage=token_usage,
                        latency_ms=latency_ms,
                        status="ok",
                    )
            except Exception as error:
                latency_ms = int((time.time() - step_started_at) * 1000)
                result_event = "tool_result" if step_type == "tool_call" else "llm_result"
                log_event(
                    result_event,
                    run_id=run_id,
                    trace_id=trace_id,
                    step_type=step_type,
                    tool=tool_name,
                    model=getattr(step, "model", None),
                    latency_ms=latency_ms,
                    status="error",
                    error_class=type(error).__name__,
                    error_message=str(error),
                )
                run_status = "error"
                stop_reason = "tool_error" if step_type == "tool_call" else "step_error"
                raise

            if result.is_final:
                stop_reason = "completed"
                break
    finally:
        log_event(
            "run_finished",
            run_id=run_id,
            trace_id=trace_id,
            status=run_status,
            stop_reason=stop_reason,
            total_steps=steps,
            total_latency_ms=int((time.time() - started_at) * 1000),
        )

In production, these events are usually sent to a centralized logging system (for example ELK, Datadog, or ClickHouse), and then used for dashboards and alerts.

This example is enough to:

find a problematic tool call;
calculate per-step latency;
understand why the run stopped.

For example, one JSON log record can look like this:

JSON

{
  "timestamp": "2026-03-21T15:17:00Z",
  "event": "tool_result",
  "run_id": "run_9fd2",
  "trace_id": "tr_9fd2",
  "tool": "search_docs",
  "latency_ms": 410,
  "status": "ok"
}

Common Mistakes

Even with logging already added, incidents are often hard to investigate because of the mistakes below.

Only final answer is logged

Without intermediate events, you cannot see how the agent reached the result. In this mode, even a simple incident takes too long to investigate.

No stable identifiers (`run_id`, `trace_id`)

When events are not correlated, you cannot reconstruct one full run. In production this often turns debugging into manual searching across services.

Raw prompts or raw args are logged without redaction

This is a direct risk of leaking personal or sensitive data. It is safer to log hashes, redacted fields, or anonymized versions.

`tool_result` and `stop_reason` are not logged

If tool_result and stop_reason are missing, it is hard to understand what exactly failed. These gaps often mask tool failure or an early phase of tool spam.

Self-Check

Below is a short checklist for baseline agent logging before release.

Each event contains run_id and trace_id
run_started and run_finished are logged
Each step includes step_type, step_index, and status
Tool calls include tool_name, args_hash, and latency_ms
stop_reason is logged for each run
Raw prompts and raw args are not stored without redaction
Logs are structured (JSON) and filterable by run_id
Alerts exist for tool_error_rate, latency spikes, and retries
A quick query/dashboard exists for investigating one run

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is logging different from tracing?
A: Logging answers "what happened" and records events. Tracing answers "how exactly it happened" through step sequence and links.

Q: What should be logged first if logging is almost missing?
A: Start with basics: run_id, trace_id, run_started, tool_call, tool_result, run_finished, and stop_reason. This is already enough for baseline debugging.

Q: Can prompts be logged fully?
A: By default, better no. In production, prompts often contain sensitive data. Safer options are hashes or redacted versions.

Q: How to know if logging is already sufficient?
A: If you can reconstruct one problematic run and find the failure point in 5-10 minutes, your baseline logging is working.

Next pages on this topic:

Observability for AI Agents — overall model of traces, logs, and metrics.
Agent Tracing — how to see one run path step by step.
Distributed Agent Tracing — how to connect events across services.
Debugging Agent Runs — practical incident investigation.
Agent Metrics — which indicators are needed for stable operations.

Agent logging

Idea In 30 Seconds

Core Problem

How It Works

Which events to log first

When To Use

Implementation Example

Common Mistakes

Only final answer is logged

No stable identifiers (`run_id`, `trace_id`)

Raw prompts or raw args are logged without redaction

`tool_result` and `stop_reason` are not logged

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Agent logging

Idea In 30 Seconds

Core Problem

How It Works

Which events to log first

When To Use

Implementation Example

Common Mistakes

Only final answer is logged

No stable identifiers (run_id, trace_id)

Raw prompts or raw args are logged without redaction

tool_result and stop_reason are not logged

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note

No stable identifiers (`run_id`, `trace_id`)

`tool_result` and `stop_reason` are not logged