Tool Usage Metrics

Tool usage metrics for agents: call frequency, error rates, retries, and hotspots to optimize reliability, cost, and policy enforcement.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Typical Production Tool Metrics
  5. How To Read The Tool Layer
  6. When To Use
  7. Implementation Example
  8. Investigation
  9. Common Mistakes
  10. Total calls exist, but no per-tool breakdown
  11. Repeated calls are not tracked
  12. No p95 latency by tool
  13. High-cardinality labels
  14. No tool-layer alerts
  15. Self-Check
  16. FAQ
  17. Related Pages

Idea In 30 Seconds

Tool usage metrics show not just whether an agent works, but how it uses tools and where the tool layer breaks.

They help you see which tools are called most often, where latency grows, and where repeated or failed calls begin.

Without these metrics, it is hard to catch tool-layer overload and cost growth in time.

Core Problem

General run metrics do not show what exactly happens at tool level.

Two runs can have similar overall latency, but in one case the issue is a slow search, and in another it is repeated fetch calls. Without tool metrics, this is hard to see before an incident.

Next, we break down how to read these signals and find problems.

In production, this usually looks like:

  • one tool quietly becomes a hot spot;
  • retries grow, but the reason is unclear;
  • some runs spend too many steps specifically on tools;
  • the team sees the issue only when error rate or budget spikes.

That is why the tool layer should be monitored separately, not only through overall run metrics.

How It Works

Tool usage metrics are built around tool_call and tool_result events.

Tool metrics are split into:

  • infra metrics (tool_latency_p95, tool_error_rate);
  • behavior metrics (repeated_tool_calls, tool_calls_per_run, unique_tools_per_run).

These metrics answer "how the tool layer behaves over time". Logs and tracing are still needed to explain one specific problematic run.

Retries usually happen at runtime level, not code level: the agent receives a tool error as observation and tries again. Retries are not just repeated calls, but a signal that the agent is trying to adapt to tool failures.

Typical Production Tool Metrics

MetricWhat it showsWhy it matters
tool_calls_totaltotal number of tool callsload control for the tool layer
tool_calls_per_runhow many tool calls happen in one rundetection of excessive or cyclic calls
unique_tools_per_runhow many distinct tools a run usesworkflow complexity assessment
tool_error_rateshare of failed tool callsearly detection of unstable tools
tool_latency_p50 / p95typical and tail latency for toolslocalization of slow dependencies
repeated_tool_callscalls that repeat the same tool with the same argsdetection of tool spam
tool_cost_per_runestimated tool cost within one runbudget control and expensive-tool detection

To make metrics practical, they are usually segmented by tool, release, and when needed model.

Important: do not add high-cardinality fields (run_id, request_id, args_hash) to labels, or metric storage will overload quickly.

How To Read The Tool Layer

What is called -> how the agent behaves -> what changes over time. These are three levels you should always read together.

Always focus on time trends and release-to-release differences, not one-off values.

Now look at signal combinations:

  • tool_error_rate ↑ + repeated_tool_calls ↑ -> tool is unstable, agent retries
  • tool_latency_p95 ↑ + tool_cost_per_run ↑ -> degradation in an expensive tool
  • tool_calls_per_run ↑ + unique_tools_per_run ↑ -> excessive workflow complexity

When To Use

A full set of tool metrics is not always required.

For a simple agent with 1-2 tools, sometimes tool_calls_total and tool_error_rate are enough.

But detailed tool usage metrics become critical when:

  • the agent heavily uses external APIs or DBs;
  • retries happen often;
  • tool costs must be controlled;
  • you need to detect tool spam before users are impacted.

Implementation Example

Below is a simplified Prometheus-style instrumentation example for tool usage metrics. The example covers baseline control: call volume, latency, error classes, repeats, and run-level tool load.

PYTHON
import hashlib
import json
import time
from prometheus_client import Counter, Histogram

TOOL_CALL_TOTAL = Counter(
    "agent_tool_call_total",
    "Total tool calls",
    ["tool", "status", "release"],
)

TOOL_ERROR_TOTAL = Counter(
    "agent_tool_error_total",
    "Total tool errors by class",
    ["tool", "error_class", "release"],
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOOL_CALLS_PER_RUN = Histogram(
    "agent_tool_calls_per_run",
    "Number of tool calls per run",
    ["release"],
    buckets=(0, 1, 2, 4, 8, 12, 16, 24, 32),
)

UNIQUE_TOOLS_PER_RUN = Histogram(
    "agent_unique_tools_per_run",
    "Number of unique tools used in run",
    ["release"],
    buckets=(0, 1, 2, 3, 4, 6, 8, 12),
)

REPEATED_TOOL_CALL_TOTAL = Counter(
    "agent_repeated_tool_call_total",
    "Repeated tool calls with same tool+args signature",
    ["tool", "release"],
)

TOOL_COST_USD_TOTAL = Counter(
    "agent_tool_cost_usd_total",
    "Estimated total tool cost in USD",
    ["tool", "release"],
)

STEP_ERROR_TOTAL = Counter(
    "agent_step_error_total",
    "Total non-tool step errors by type and class",
    ["step_type", "error_class", "release"],
)


def stable_hash(value):
    # default=str gives baseline compatibility;
    # in critical systems explicit serialization is better (for example ISO 8601)
    payload = json.dumps(value, sort_keys=True, ensure_ascii=False, default=str).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def run_agent(agent, task, release="2026-03-21"):
    tool_calls = 0
    unique_tools = set()
    seen_signatures = set()

    try:
        for step in agent.iter(task):
            step_type = step.type
            result = None

            if step_type != "tool_call":
                try:
                    result = step.execute()
                except Exception as error:
                    STEP_ERROR_TOTAL.labels(
                        step_type=step_type,
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    raise

                if result and result.is_final:
                    break
                continue

            tool_name = getattr(step, "tool_name", "unknown")
            args = getattr(step, "args", {})

            tool_calls += 1
            unique_tools.add(tool_name)

            signature = (tool_name, stable_hash(args))
            if signature in seen_signatures:
                REPEATED_TOOL_CALL_TOTAL.labels(tool=tool_name, release=release).inc()
            else:
                seen_signatures.add(signature)

            started_at = time.time()
            try:
                result = step.execute()
                TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
                cost_usd = getattr(result, "cost_usd", None)
                if cost_usd:
                    TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(cost_usd)
            except Exception as error:
                TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
                TOOL_ERROR_TOTAL.labels(
                    tool=tool_name,
                    error_class=type(error).__name__,
                    release=release,
                ).inc()
                # This example raises.
                # In real agents, the error is often passed to the LLM as observation for retry.
                raise
            finally:
                TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                    (time.time() - started_at) * 1000
                )

            if result and result.is_final:
                break
    finally:
        TOOL_CALLS_PER_RUN.labels(release=release).observe(tool_calls)
        UNIQUE_TOOLS_PER_RUN.labels(release=release).observe(len(unique_tools))

# tool_cost_per_run is usually computed on dashboard level:
# sum(agent_tool_cost_usd_total) / run_count

Here is how these metrics can look together on a real dashboard:

Toolcalls/minerror_ratep95 latencyStatus
search_docs3206.8%1.9scritical: alert
fetch_url1801.4%680mswarning: p95 growing
db_lookup950.3%120msok

For error_class, use a normalized value dictionary to avoid unnecessary cardinality.

Investigation

When an alert fires:

  1. find the anomalous tool in metrics;
  2. inspect concrete runs in tracing;
  3. check arguments and responses in logs;
  4. find root cause (tool, agent logic, or external API).

Common Mistakes

Even when tool metrics are already added, they often fail because of common mistakes below.

Total calls exist, but no per-tool breakdown

tool_calls_total without per-tool split is almost useless in incidents. In this case it is hard to quickly find the source of tool failure.

Repeated calls are not tracked

Without repeated_tool_calls, it is hard to see that the agent calls the same tool with the same args. This often hides the early phase of tool spam.

No p95 latency by tool

The system can look stable while some users already wait 5+ seconds. For tool layer, minimum baseline is p50 and p95.

High-cardinality labels

Adding run_id, request_id, or args_hash to labels quickly overloads metric backends. Keep these in logs, not in labels.

No tool-layer alerts

Without alerts, metrics remain passive telemetry. This makes it easy to miss early signals of budget explosion caused by excessive external API calls.

Self-Check

Below is a short checklist of baseline tool usage metrics before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How are tool usage metrics different from general agent metrics?
A: General metrics show overall system state. Tool usage metrics show what is happening specifically in the tool layer.

Q: What is the minimum tool-metric set to start with?
A: Start with tool_calls_total, tool_error_rate, tool_latency_p95, and tool_calls_per_run.

Q: Should args_hash be added to labels?
A: No. It almost always creates high cardinality. For this kind of data, use structured logs.

Q: How do you separate a one-off failure from a system-level tool-layer issue?
A: Check whether the issue repeats for a specific tool across multiple runs and releases. If the same signals (error_class, latency, repeated_tool_calls) repeat, it is systemic.

Next on this topic:

⏱️ 8 min read β€’ Updated March 22, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.