Agent Metrics

Agent metrics that matter in production: success rate, tool-call volume, retries, cost per run, and drift signals for early detection.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Typical Production Metrics For Agents
  5. When To Use
  6. Implementation Example
  7. Common Mistakes
  8. Only averages, without p95/p99
  9. High-cardinality labels
  10. No stop_reason metrics
  11. No alerts for key anomalies
  12. Self-Check
  13. FAQ
  14. Related Pages

Idea In 30 Seconds

Agent metrics show system health across many runs, not a single case.

They answer: is the system stable, are costs growing, and where degradation starts.

Without metrics, problems are usually discovered too late: after user complaints or budget overruns.

Core Problem

Logs and tracing explain one specific incident well.

But in production you need trends: what happens to latency, token usage, error rate, and tool calls across releases. Without metrics, systems can degrade gradually and stay unnoticed for a long time.

In production, this usually looks like:

  • average response time seems fine, but p95 is already growing;
  • token costs spike in waves after a release;
  • tool calls per run keep increasing;
  • the team learns about problems only after an incident.

That is why metrics are a separate observability signal: they help detect anomalies early and react before major failures.

How It Works

Metrics are aggregated numeric signals that show system behavior over time.

Most systems track three metric layers:

  • run level (run_count, success_rate, stop_reason);
  • steps and tools (tool_calls_per_run, tool_error_rate, step_count);
  • cost and speed (token_usage, cost_per_run (computed from token_usage on dashboards or metrics queries), latency_p50/p95).

Metrics provide early warning when the system starts degrading. Logs answer "what happened", and tracing answers "how exactly it happened in a specific run".

Typical Production Metrics For Agents

MetricWhat it showsWhy it matters
run_countnumber of runs in a periodload and traffic-volume control
success_rateshare of successful runsfast stability check
latency_p50 / latency_p95typical and tail latencyperformance degradation detection
token_usage_per_runhow many tokens each run consumesLLM cost control
cost_per_runestimated cost of one runbudget control and cost forecasting
tool_calls_per_runhow many tool calls each run makesdetection of excessive or cyclic calls
tool_error_ratefrequency of tool failuresearly detection of unstable dependencies
stop_reason_distributiondistribution of run termination reasonscontrol of limits and common failures

To make metrics useful, they are usually segmented by release, model, or tool.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to metric labels, or metric storage will overload quickly.

When To Use

A broad metrics set is not always necessary.

For an early prototype, basic run and error counters can be enough.

But metrics become critical when:

  • the agent system is already in production;
  • you have latency or quality SLOs;
  • token and tool-call costs must be controlled;
  • releases happen often and regressions must be seen before incidents.

Implementation Example

Below is a simplified runtime instrumentation example in Prometheus style. In real systems, the same principles work for Datadog, Grafana Cloud, CloudWatch, and other platforms.

PYTHON
import time
from prometheus_client import Counter, Histogram

RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)
# success_rate = RUN_TOTAL{status="ok"} / RUN_TOTAL

RUN_LATENCY_MS = Histogram(
    "agent_run_latency_ms",
    "Run latency in milliseconds",
    ["release"],
    buckets=(100, 250, 500, 1000, 2000, 5000, 10000),
)

STEP_COUNT = Histogram(
    "agent_steps_per_run",
    "Number of steps per run",
    ["release"],
    buckets=(1, 2, 4, 8, 12, 16, 24, 32),
)

TOOL_CALL_TOTAL = Counter(
    "agent_tool_call_total",
    "Total tool calls",
    ["tool", "status", "release"],
)

TOOL_ERROR_TOTAL = Counter(
    "agent_tool_error_total",
    "Total tool errors by class",
    ["tool", "error_class", "release"],
)

LLM_ERROR_TOTAL = Counter(
    "agent_llm_error_total",
    "Total LLM step errors by model and class",
    ["model", "error_class", "release"],
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool call latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOKEN_USAGE_TOTAL = Counter(
    "agent_token_usage_total",
    "Total LLM tokens",
    ["model", "token_type", "release"],
)


def observe_llm_usage(model, token_usage, release):
    # most LLM providers return token usage in the response
    if not token_usage:
        return
    TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
        token_usage.get("prompt_tokens", 0)
    )
    TOKEN_USAGE_TOTAL.labels(model=model, token_type="completion", release=release).inc(
        token_usage.get("completion_tokens", 0)
    )


def run_agent(agent, task, release="2026-03-21"):
    started_at = time.time()
    steps = 0
    stop_reason = "max_steps"
    run_status = "ok"

    try:
        for step in agent.iter(task):
            steps += 1
            step_type = step.type
            result = None  # may stay None for unknown step types (guarded by the check below)

            if step_type == "tool_call":
                tool_name = getattr(step, "tool_name", "unknown")
                tool_started_at = time.time()
                try:
                    result = step.execute()
                    TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
                    TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                        (time.time() - tool_started_at) * 1000
                    )
                except Exception as error:
                    TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
                    TOOL_ERROR_TOTAL.labels(
                        tool=tool_name,
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                        (time.time() - tool_started_at) * 1000
                    )
                    run_status = "error"
                    stop_reason = "tool_error"
                    raise
            else:
                try:
                    result = step.execute()
                    observe_llm_usage(
                        model=getattr(step, "model", "unknown"),
                        token_usage=getattr(result, "token_usage", None),
                        release=release,
                    )
                except Exception as error:
                    LLM_ERROR_TOTAL.labels(
                        model=getattr(step, "model", "unknown"),
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    run_status = "error"
                    stop_reason = "step_error"
                    raise

            if result and result.is_final:
                stop_reason = "completed"
                break

    finally:
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()
        RUN_LATENCY_MS.labels(release=release).observe((time.time() - started_at) * 1000)
        STEP_COUNT.labels(release=release).observe(steps)

In production, these metrics usually feed dashboards and alerts.

This is how they can look together on a real dashboard:

MetricCurrent valueTrendStatus
latency_p952.4s+38% in 30 minwarning: above SLO
tool_error_rate7.2%+4.1pp in 15 mincritical: alert
token_usage_per_run8.9k+22% after releasewarning: anomaly
success_rate91.4%-5.3pp in 1 hourwarning: drop

For error_class, it is better to use a normalized value dictionary to avoid unnecessary cardinality.

For example, one metric line can look like this:

TEXT
agent_tool_call_total{tool="search_docs",status="error",release="2026-03-21"} 47

Common Mistakes

Even when metrics are added, they often fail to help because of common mistakes below.

Only averages, without p95/p99

Averages hide the long tail of slow runs. For production, minimum baseline is p50 and p95.

High-cardinality labels

Labels like run_id or user_id sharply increase load on metric backends. It is better to segment by release, model, tool.

No stop_reason metrics

Without stop_reason distribution, it is hard to understand why runs end with max_steps or tool_error. This often hides tool failure and early signs of budget explosion.

No alerts for key anomalies

Metrics without alerts turn into passive charts. Without alerts, it is easy to miss tool spam or a sharp success-rate drop after a release.

Self-Check

Below is a short checklist of baseline agent metrics before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How are metrics different from logs and tracing?
A: Metrics show trends and system state over time. Logs explain events, and tracing shows the path of a specific run.

Q: What is the minimum metric set for a first production release?
A: Start with run_count, success_rate, latency_p95, tool_error_rate, token_usage_per_run, and stop_reason_distribution.

Q: Why is average latency not enough?
A: Averages hide long slow runs. p95 reveals user-facing degradation much faster.

Q: Which labels most often break metric storage?
A: Anything high-cardinality: run_id, request_id, user_id, full prompts, or raw args.

Next on this topic:

⏱️ 7 min read β€’ Updated March 21, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.