Idea In 30 Seconds
Tool usage metrics show not just whether an agent works, but how it uses tools and where the tool layer breaks.
They help you see which tools are called most often, where latency grows, and where repeated or failed calls begin.
Without these metrics, it is hard to catch tool-layer overload and cost growth in time.
Core Problem
General run metrics do not show what exactly happens at tool level.
Two runs can have similar overall latency, but in one case the issue is a slow search, and in another it is repeated fetch calls.
Without tool metrics, this is hard to see before an incident.
Next, we break down how to read these signals and find problems.
In production, this usually looks like:
- one tool quietly becomes a hot spot;
- retries grow, but the reason is unclear;
- some runs spend too many steps specifically on tools;
- the team sees the issue only when error rate or budget spikes.
That is why the tool layer should be monitored separately, not only through overall run metrics.
How It Works
Tool usage metrics are built around tool_call and tool_result events.
Tool metrics are split into:
- infra metrics (
tool_latency_p95,tool_error_rate); - behavior metrics (
repeated_tool_calls,tool_calls_per_run,unique_tools_per_run).
These metrics answer "how the tool layer behaves over time". Logs and tracing are still needed to explain one specific problematic run.
Retries usually happen at runtime level, not code level: the agent receives a tool error as observation and tries again. Retries are not just repeated calls, but a signal that the agent is trying to adapt to tool failures.
Typical Production Tool Metrics
| Metric | What it shows | Why it matters |
|---|---|---|
| tool_calls_total | total number of tool calls | load control for the tool layer |
| tool_calls_per_run | how many tool calls happen in one run | detection of excessive or cyclic calls |
| unique_tools_per_run | how many distinct tools a run uses | workflow complexity assessment |
| tool_error_rate | share of failed tool calls | early detection of unstable tools |
| tool_latency_p50 / p95 | typical and tail latency for tools | localization of slow dependencies |
| repeated_tool_calls | calls that repeat the same tool with the same args | detection of tool spam |
| tool_cost_per_run | estimated tool cost within one run | budget control and expensive-tool detection |
To make metrics practical, they are usually segmented by tool, release, and when needed model.
Important: do not add high-cardinality fields (run_id, request_id, args_hash) to labels, or metric storage will overload quickly.
How To Read The Tool Layer
What is called -> how the agent behaves -> what changes over time. These are three levels you should always read together.
Always focus on time trends and release-to-release differences, not one-off values.
Now look at signal combinations:
tool_error_rateβ +repeated_tool_callsβ -> tool is unstable, agent retriestool_latency_p95β +tool_cost_per_runβ -> degradation in an expensive tooltool_calls_per_runβ +unique_tools_per_runβ -> excessive workflow complexity
When To Use
A full set of tool metrics is not always required.
For a simple agent with 1-2 tools, sometimes tool_calls_total and tool_error_rate are enough.
But detailed tool usage metrics become critical when:
- the agent heavily uses external APIs or DBs;
- retries happen often;
- tool costs must be controlled;
- you need to detect tool spam before users are impacted.
Implementation Example
Below is a simplified Prometheus-style instrumentation example for tool usage metrics. The example covers baseline control: call volume, latency, error classes, repeats, and run-level tool load.
import hashlib
import json
import time
from prometheus_client import Counter, Histogram
TOOL_CALL_TOTAL = Counter(
"agent_tool_call_total",
"Total tool calls",
["tool", "status", "release"],
)
TOOL_ERROR_TOTAL = Counter(
"agent_tool_error_total",
"Total tool errors by class",
["tool", "error_class", "release"],
)
TOOL_LATENCY_MS = Histogram(
"agent_tool_latency_ms",
"Tool latency in milliseconds",
["tool", "release"],
buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)
TOOL_CALLS_PER_RUN = Histogram(
"agent_tool_calls_per_run",
"Number of tool calls per run",
["release"],
buckets=(0, 1, 2, 4, 8, 12, 16, 24, 32),
)
UNIQUE_TOOLS_PER_RUN = Histogram(
"agent_unique_tools_per_run",
"Number of unique tools used in run",
["release"],
buckets=(0, 1, 2, 3, 4, 6, 8, 12),
)
REPEATED_TOOL_CALL_TOTAL = Counter(
"agent_repeated_tool_call_total",
"Repeated tool calls with same tool+args signature",
["tool", "release"],
)
TOOL_COST_USD_TOTAL = Counter(
"agent_tool_cost_usd_total",
"Estimated total tool cost in USD",
["tool", "release"],
)
STEP_ERROR_TOTAL = Counter(
"agent_step_error_total",
"Total non-tool step errors by type and class",
["step_type", "error_class", "release"],
)
def stable_hash(value):
# default=str gives baseline compatibility;
# in critical systems explicit serialization is better (for example ISO 8601)
payload = json.dumps(value, sort_keys=True, ensure_ascii=False, default=str).encode("utf-8")
return hashlib.sha256(payload).hexdigest()
def run_agent(agent, task, release="2026-03-21"):
tool_calls = 0
unique_tools = set()
seen_signatures = set()
try:
for step in agent.iter(task):
step_type = step.type
result = None
if step_type != "tool_call":
try:
result = step.execute()
except Exception as error:
STEP_ERROR_TOTAL.labels(
step_type=step_type,
error_class=type(error).__name__,
release=release,
).inc()
raise
if result and result.is_final:
break
continue
tool_name = getattr(step, "tool_name", "unknown")
args = getattr(step, "args", {})
tool_calls += 1
unique_tools.add(tool_name)
signature = (tool_name, stable_hash(args))
if signature in seen_signatures:
REPEATED_TOOL_CALL_TOTAL.labels(tool=tool_name, release=release).inc()
else:
seen_signatures.add(signature)
started_at = time.time()
try:
result = step.execute()
TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
cost_usd = getattr(result, "cost_usd", None)
if cost_usd:
TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(cost_usd)
except Exception as error:
TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
TOOL_ERROR_TOTAL.labels(
tool=tool_name,
error_class=type(error).__name__,
release=release,
).inc()
# This example raises.
# In real agents, the error is often passed to the LLM as observation for retry.
raise
finally:
TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
(time.time() - started_at) * 1000
)
if result and result.is_final:
break
finally:
TOOL_CALLS_PER_RUN.labels(release=release).observe(tool_calls)
UNIQUE_TOOLS_PER_RUN.labels(release=release).observe(len(unique_tools))
# tool_cost_per_run is usually computed on dashboard level:
# sum(agent_tool_cost_usd_total) / run_count
Here is how these metrics can look together on a real dashboard:
| Tool | calls/min | error_rate | p95 latency | Status |
|---|---|---|---|---|
| search_docs | 320 | 6.8% | 1.9s | critical: alert |
| fetch_url | 180 | 1.4% | 680ms | warning: p95 growing |
| db_lookup | 95 | 0.3% | 120ms | ok |
For error_class, use a normalized value dictionary to avoid unnecessary cardinality.
Investigation
When an alert fires:
- find the anomalous tool in metrics;
- inspect concrete runs in tracing;
- check arguments and responses in logs;
- find root cause (tool, agent logic, or external API).
Common Mistakes
Even when tool metrics are already added, they often fail because of common mistakes below.
Total calls exist, but no per-tool breakdown
tool_calls_total without per-tool split is almost useless in incidents.
In this case it is hard to quickly find the source of tool failure.
Repeated calls are not tracked
Without repeated_tool_calls, it is hard to see that the agent calls the same tool with the same args.
This often hides the early phase of tool spam.
No p95 latency by tool
The system can look stable while some users already wait 5+ seconds.
For tool layer, minimum baseline is p50 and p95.
High-cardinality labels
Adding run_id, request_id, or args_hash to labels quickly overloads metric backends.
Keep these in logs, not in labels.
No tool-layer alerts
Without alerts, metrics remain passive telemetry. This makes it easy to miss early signals of budget explosion caused by excessive external API calls.
Self-Check
Below is a short checklist of baseline tool usage metrics before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How are tool usage metrics different from general agent metrics?
A: General metrics show overall system state. Tool usage metrics show what is happening specifically in the tool layer.
Q: What is the minimum tool-metric set to start with?
A: Start with tool_calls_total, tool_error_rate, tool_latency_p95, and tool_calls_per_run.
Q: Should args_hash be added to labels?
A: No. It almost always creates high cardinality. For this kind of data, use structured logs.
Q: How do you separate a one-off failure from a system-level tool-layer issue?
A: Check whether the issue repeats for a specific tool across multiple runs and releases. If the same signals (error_class, latency, repeated_tool_calls) repeat, it is systemic.
Related Pages
Next on this topic:
- Agent Metrics β overall metrics model for agent systems.
- Agent Logging β events needed for incident analysis.
- Agent Tracing β one run path step by step.
- Semantic Logging for Agents β stable event vocabulary for analytics.
- AI Agent Cost Monitoring β cost control in production.