Idea In 30 Seconds
Agent metrics show system health across many runs, not a single case.
They answer: is the system stable, are costs growing, and where degradation starts.
Without metrics, problems are usually discovered too late: after user complaints or budget overruns.
Core Problem
Logs and tracing explain one specific incident well.
But in production you need trends: what happens to latency, token usage, error rate, and tool calls across releases. Without metrics, systems can degrade gradually and stay unnoticed for a long time.
In production, this usually looks like:
- average response time seems fine, but p95 is already growing;
- token costs spike in waves after a release;
- tool calls per run keep increasing;
- the team learns about problems only after an incident.
That is why metrics are a separate observability signal: they help detect anomalies early and react before major failures.
How It Works
Metrics are aggregated numeric signals that show system behavior over time.
Most systems track three metric layers:
- run level (
run_count,success_rate,stop_reason); - steps and tools (
tool_calls_per_run,tool_error_rate,step_count); - cost and speed (
token_usage,cost_per_run(computed fromtoken_usageon dashboards or metrics queries),latency_p50/p95).
Metrics provide early warning when the system starts degrading. Logs answer "what happened", and tracing answers "how exactly it happened in a specific run".
Typical Production Metrics For Agents
| Metric | What it shows | Why it matters |
|---|---|---|
| run_count | number of runs in a period | load and traffic-volume control |
| success_rate | share of successful runs | fast stability check |
| latency_p50 / latency_p95 | typical and tail latency | performance degradation detection |
| token_usage_per_run | how many tokens each run consumes | LLM cost control |
| cost_per_run | estimated cost of one run | budget control and cost forecasting |
| tool_calls_per_run | how many tool calls each run makes | detection of excessive or cyclic calls |
| tool_error_rate | frequency of tool failures | early detection of unstable dependencies |
| stop_reason_distribution | distribution of run termination reasons | control of limits and common failures |
To make metrics useful, they are usually segmented by release, model, or tool.
Important: do not add high-cardinality fields (run_id, request_id, user_id) to metric labels, or metric storage will overload quickly.
When To Use
A broad metrics set is not always necessary.
For an early prototype, basic run and error counters can be enough.
But metrics become critical when:
- the agent system is already in production;
- you have latency or quality SLOs;
- token and tool-call costs must be controlled;
- releases happen often and regressions must be seen before incidents.
Implementation Example
Below is a simplified runtime instrumentation example in Prometheus style. In real systems, the same principles work for Datadog, Grafana Cloud, CloudWatch, and other platforms.
import time
from prometheus_client import Counter, Histogram
RUN_TOTAL = Counter(
"agent_run_total",
"Total number of agent runs",
["status", "stop_reason", "release"],
)
# success_rate = RUN_TOTAL{status="ok"} / RUN_TOTAL
RUN_LATENCY_MS = Histogram(
"agent_run_latency_ms",
"Run latency in milliseconds",
["release"],
buckets=(100, 250, 500, 1000, 2000, 5000, 10000),
)
STEP_COUNT = Histogram(
"agent_steps_per_run",
"Number of steps per run",
["release"],
buckets=(1, 2, 4, 8, 12, 16, 24, 32),
)
TOOL_CALL_TOTAL = Counter(
"agent_tool_call_total",
"Total tool calls",
["tool", "status", "release"],
)
TOOL_ERROR_TOTAL = Counter(
"agent_tool_error_total",
"Total tool errors by class",
["tool", "error_class", "release"],
)
LLM_ERROR_TOTAL = Counter(
"agent_llm_error_total",
"Total LLM step errors by model and class",
["model", "error_class", "release"],
)
TOOL_LATENCY_MS = Histogram(
"agent_tool_latency_ms",
"Tool call latency in milliseconds",
["tool", "release"],
buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)
TOKEN_USAGE_TOTAL = Counter(
"agent_token_usage_total",
"Total LLM tokens",
["model", "token_type", "release"],
)
def observe_llm_usage(model, token_usage, release):
# most LLM providers return token usage in the response
if not token_usage:
return
TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
token_usage.get("prompt_tokens", 0)
)
TOKEN_USAGE_TOTAL.labels(model=model, token_type="completion", release=release).inc(
token_usage.get("completion_tokens", 0)
)
def run_agent(agent, task, release="2026-03-21"):
started_at = time.time()
steps = 0
stop_reason = "max_steps"
run_status = "ok"
try:
for step in agent.iter(task):
steps += 1
step_type = step.type
result = None # may stay None for unknown step types (guarded by the check below)
if step_type == "tool_call":
tool_name = getattr(step, "tool_name", "unknown")
tool_started_at = time.time()
try:
result = step.execute()
TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
(time.time() - tool_started_at) * 1000
)
except Exception as error:
TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
TOOL_ERROR_TOTAL.labels(
tool=tool_name,
error_class=type(error).__name__,
release=release,
).inc()
TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
(time.time() - tool_started_at) * 1000
)
run_status = "error"
stop_reason = "tool_error"
raise
else:
try:
result = step.execute()
observe_llm_usage(
model=getattr(step, "model", "unknown"),
token_usage=getattr(result, "token_usage", None),
release=release,
)
except Exception as error:
LLM_ERROR_TOTAL.labels(
model=getattr(step, "model", "unknown"),
error_class=type(error).__name__,
release=release,
).inc()
run_status = "error"
stop_reason = "step_error"
raise
if result and result.is_final:
stop_reason = "completed"
break
finally:
RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()
RUN_LATENCY_MS.labels(release=release).observe((time.time() - started_at) * 1000)
STEP_COUNT.labels(release=release).observe(steps)
In production, these metrics usually feed dashboards and alerts.
This is how they can look together on a real dashboard:
| Metric | Current value | Trend | Status |
|---|---|---|---|
| latency_p95 | 2.4s | +38% in 30 min | warning: above SLO |
| tool_error_rate | 7.2% | +4.1pp in 15 min | critical: alert |
| token_usage_per_run | 8.9k | +22% after release | warning: anomaly |
| success_rate | 91.4% | -5.3pp in 1 hour | warning: drop |
For error_class, it is better to use a normalized value dictionary to avoid unnecessary cardinality.
For example, one metric line can look like this:
agent_tool_call_total{tool="search_docs",status="error",release="2026-03-21"} 47
Common Mistakes
Even when metrics are added, they often fail to help because of common mistakes below.
Only averages, without p95/p99
Averages hide the long tail of slow runs.
For production, minimum baseline is p50 and p95.
High-cardinality labels
Labels like run_id or user_id sharply increase load on metric backends.
It is better to segment by release, model, tool.
No stop_reason metrics
Without stop_reason distribution, it is hard to understand why runs end with max_steps or tool_error.
This often hides tool failure and early signs of budget explosion.
No alerts for key anomalies
Metrics without alerts turn into passive charts. Without alerts, it is easy to miss tool spam or a sharp success-rate drop after a release.
Self-Check
Below is a short checklist of baseline agent metrics before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How are metrics different from logs and tracing?
A: Metrics show trends and system state over time. Logs explain events, and tracing shows the path of a specific run.
Q: What is the minimum metric set for a first production release?
A: Start with run_count, success_rate, latency_p95, tool_error_rate, token_usage_per_run, and stop_reason_distribution.
Q: Why is average latency not enough?
A: Averages hide long slow runs. p95 reveals user-facing degradation much faster.
Q: Which labels most often break metric storage?
A: Anything high-cardinality: run_id, request_id, user_id, full prompts, or raw args.
Related Pages
Next on this topic:
- Observability for AI Agents β overall model of tracing, logging, and metrics.
- Agent Logging β which events to capture in runtime.
- Agent Tracing β how to see one run path step by step.
- Semantic Logging for Agents β how to standardize events for analytics.
- AI Agent Cost Monitoring β how to control production cost.