Agent Metrics

Idea In 30 Seconds

Agent metrics show system health across many runs, not a single case.

They answer: is the system stable, are costs growing, and where degradation starts.

Without metrics, problems are usually discovered too late: after user complaints or budget overruns.

Core Problem

Logs and tracing explain one specific incident well.

But in production you need trends: what happens to latency, token usage, error rate, and tool calls across releases. Without metrics, systems can degrade gradually and stay unnoticed for a long time.

In production, this usually looks like:

average response time seems fine, but p95 is already growing;
token costs spike in waves after a release;
tool calls per run keep increasing;
the team learns about problems only after an incident.

That is why metrics are a separate observability signal: they help detect anomalies early and react before major failures.

How It Works

Metrics are aggregated numeric signals that show system behavior over time.

Most systems track three metric layers:

run level (run_count, success_rate, stop_reason);
steps and tools (tool_calls_per_run, tool_error_rate, step_count);
cost and speed (token_usage, cost_per_run (computed from token_usage on dashboards or metrics queries), latency_p50/p95).

Metrics provide early warning when the system starts degrading. Logs answer "what happened", and tracing answers "how exactly it happened in a specific run".

Typical Production Metrics For Agents

Metric	What it shows	Why it matters
run_count	number of runs in a period	load and traffic-volume control
success_rate	share of successful runs	fast stability check
latency_p50 / latency_p95	typical and tail latency	performance degradation detection
token_usage_per_run	how many tokens each run consumes	LLM cost control
cost_per_run	estimated cost of one run	budget control and cost forecasting
tool_calls_per_run	how many tool calls each run makes	detection of excessive or cyclic calls
tool_error_rate	frequency of tool failures	early detection of unstable dependencies
stop_reason_distribution	distribution of run termination reasons	control of limits and common failures

To make metrics useful, they are usually segmented by release, model, or tool.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to metric labels, or metric storage will overload quickly.

When To Use

A broad metrics set is not always necessary.

For an early prototype, basic run and error counters can be enough.

But metrics become critical when:

the agent system is already in production;
you have latency or quality SLOs;
token and tool-call costs must be controlled;
releases happen often and regressions must be seen before incidents.

Implementation Example

Below is a simplified runtime instrumentation example in Prometheus style. In real systems, the same principles work for Datadog, Grafana Cloud, CloudWatch, and other platforms.

PYTHON

import time
from prometheus_client import Counter, Histogram

RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)
# success_rate = RUN_TOTAL{status="ok"} / RUN_TOTAL

RUN_LATENCY_MS = Histogram(
    "agent_run_latency_ms",
    "Run latency in milliseconds",
    ["release"],
    buckets=(100, 250, 500, 1000, 2000, 5000, 10000),
)

STEP_COUNT = Histogram(
    "agent_steps_per_run",
    "Number of steps per run",
    ["release"],
    buckets=(1, 2, 4, 8, 12, 16, 24, 32),
)

TOOL_CALL_TOTAL = Counter(
    "agent_tool_call_total",
    "Total tool calls",
    ["tool", "status", "release"],
)

TOOL_ERROR_TOTAL = Counter(
    "agent_tool_error_total",
    "Total tool errors by class",
    ["tool", "error_class", "release"],
)

LLM_ERROR_TOTAL = Counter(
    "agent_llm_error_total",
    "Total LLM step errors by model and class",
    ["model", "error_class", "release"],
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool call latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOKEN_USAGE_TOTAL = Counter(
    "agent_token_usage_total",
    "Total LLM tokens",
    ["model", "token_type", "release"],
)


def observe_llm_usage(model, token_usage, release):
    # most LLM providers return token usage in the response
    if not token_usage:
        return
    TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
        token_usage.get("prompt_tokens", 0)
    )
    TOKEN_USAGE_TOTAL.labels(model=model, token_type="completion", release=release).inc(
        token_usage.get("completion_tokens", 0)
    )


def run_agent(agent, task, release="2026-03-21"):
    started_at = time.time()
    steps = 0
    stop_reason = "max_steps"
    run_status = "ok"

    try:
        for step in agent.iter(task):
            steps += 1
            step_type = step.type
            result = None  # may stay None for unknown step types (guarded by the check below)

            if step_type == "tool_call":
                tool_name = getattr(step, "tool_name", "unknown")
                tool_started_at = time.time()
                try:
                    result = step.execute()
                    TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
                    TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                        (time.time() - tool_started_at) * 1000
                    )
                except Exception as error:
                    TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
                    TOOL_ERROR_TOTAL.labels(
                        tool=tool_name,
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                        (time.time() - tool_started_at) * 1000
                    )
                    run_status = "error"
                    stop_reason = "tool_error"
                    raise
            else:
                try:
                    result = step.execute()
                    observe_llm_usage(
                        model=getattr(step, "model", "unknown"),
                        token_usage=getattr(result, "token_usage", None),
                        release=release,
                    )
                except Exception as error:
                    LLM_ERROR_TOTAL.labels(
                        model=getattr(step, "model", "unknown"),
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    run_status = "error"
                    stop_reason = "step_error"
                    raise

            if result and result.is_final:
                stop_reason = "completed"
                break

    finally:
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()
        RUN_LATENCY_MS.labels(release=release).observe((time.time() - started_at) * 1000)
        STEP_COUNT.labels(release=release).observe(steps)

In production, these metrics usually feed dashboards and alerts.

This is how they can look together on a real dashboard:

Metric	Current value	Trend	Status
latency_p95	2.4s	+38% in 30 min	warning: above SLO
tool_error_rate	7.2%	+4.1pp in 15 min	critical: alert
token_usage_per_run	8.9k	+22% after release	warning: anomaly
success_rate	91.4%	-5.3pp in 1 hour	warning: drop

For error_class, it is better to use a normalized value dictionary to avoid unnecessary cardinality.

For example, one metric line can look like this:

TEXT

agent_tool_call_total{tool="search_docs",status="error",release="2026-03-21"} 47

Common Mistakes

Even when metrics are added, they often fail to help because of common mistakes below.

Only averages, without p95/p99

Averages hide the long tail of slow runs. For production, minimum baseline is p50 and p95.

High-cardinality labels

Labels like run_id or user_id sharply increase load on metric backends. It is better to segment by release, model, tool.

No stop_reason metrics

Without stop_reason distribution, it is hard to understand why runs end with max_steps or tool_error. This often hides tool failure and early signs of budget explosion.

No alerts for key anomalies

Metrics without alerts turn into passive charts. Without alerts, it is easy to miss tool spam or a sharp success-rate drop after a release.

Self-Check

Below is a short checklist of baseline agent metrics before release.

Baseline metrics run_count, success_rate, and latency_p95 are in place
stop_reason distribution is tracked for each release
tool_calls_per_run and tool_error_rate metrics are in place
token_usage_per_run is tracked and segmented by model
Metrics are segmented by release for release-to-release comparison
Labels do not include high-cardinality fields (run_id, user_id)
Alerts exist for p95 latency, error rate, and token spikes
There is a dedicated dashboard for cost and usage anomalies
Metrics are correlated with logs and tracing during incidents

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How are metrics different from logs and tracing?
A: Metrics show trends and system state over time. Logs explain events, and tracing shows the path of a specific run.

Q: What is the minimum metric set for a first production release?
A: Start with run_count, success_rate, latency_p95, tool_error_rate, token_usage_per_run, and stop_reason_distribution.

Q: Why is average latency not enough?
A: Averages hide long slow runs. p95 reveals user-facing degradation much faster.

Q: Which labels most often break metric storage?
A: Anything high-cardinality: run_id, request_id, user_id, full prompts, or raw args.

Next on this topic:

Observability for AI Agents — overall model of tracing, logging, and metrics.
Agent Logging — which events to capture in runtime.
Agent Tracing — how to see one run path step by step.
Semantic Logging for Agents — how to standardize events for analytics.
AI Agent Cost Monitoring — how to control production cost.

Agent Metrics

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Metrics For Agents

When To Use

Implementation Example

Common Mistakes

Only averages, without p95/p99

High-cardinality labels

No stop_reason metrics

No alerts for key anomalies

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Agent Metrics

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Metrics For Agents

When To Use

Implementation Example

Common Mistakes

Only averages, without p95/p99

High-cardinality labels

No stop_reason metrics

No alerts for key anomalies

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note