Tool Usage Metrics

Idea In 30 Seconds

Tool usage metrics show not just whether an agent works, but how it uses tools and where the tool layer breaks.

They help you see which tools are called most often, where latency grows, and where repeated or failed calls begin.

Without these metrics, it is hard to catch tool-layer overload and cost growth in time.

Core Problem

General run metrics do not show what exactly happens at tool level.

Two runs can have similar overall latency, but in one case the issue is a slow search, and in another it is repeated fetch calls. Without tool metrics, this is hard to see before an incident.

Next, we break down how to read these signals and find problems.

In production, this usually looks like:

one tool quietly becomes a hot spot;
retries grow, but the reason is unclear;
some runs spend too many steps specifically on tools;
the team sees the issue only when error rate or budget spikes.

That is why the tool layer should be monitored separately, not only through overall run metrics.

How It Works

Tool usage metrics are built around tool_call and tool_result events.

Tool metrics are split into:

infra metrics (tool_latency_p95, tool_error_rate);
behavior metrics (repeated_tool_calls, tool_calls_per_run, unique_tools_per_run).

These metrics answer "how the tool layer behaves over time". Logs and tracing are still needed to explain one specific problematic run.

Retries usually happen at runtime level, not code level: the agent receives a tool error as observation and tries again. Retries are not just repeated calls, but a signal that the agent is trying to adapt to tool failures.

Typical Production Tool Metrics

Metric	What it shows	Why it matters
tool_calls_total	total number of tool calls	load control for the tool layer
tool_calls_per_run	how many tool calls happen in one run	detection of excessive or cyclic calls
unique_tools_per_run	how many distinct tools a run uses	workflow complexity assessment
tool_error_rate	share of failed tool calls	early detection of unstable tools
tool_latency_p50 / p95	typical and tail latency for tools	localization of slow dependencies
repeated_tool_calls	calls that repeat the same tool with the same args	detection of tool spam
tool_cost_per_run	estimated tool cost within one run	budget control and expensive-tool detection

To make metrics practical, they are usually segmented by tool, release, and when needed model.

Important: do not add high-cardinality fields (run_id, request_id, args_hash) to labels, or metric storage will overload quickly.

How To Read The Tool Layer

What is called -> how the agent behaves -> what changes over time. These are three levels you should always read together.

Always focus on time trends and release-to-release differences, not one-off values.

Now look at signal combinations:

tool_error_rate ↑ + repeated_tool_calls ↑ -> tool is unstable, agent retries
tool_latency_p95 ↑ + tool_cost_per_run ↑ -> degradation in an expensive tool
tool_calls_per_run ↑ + unique_tools_per_run ↑ -> excessive workflow complexity

When To Use

A full set of tool metrics is not always required.

For a simple agent with 1-2 tools, sometimes tool_calls_total and tool_error_rate are enough.

But detailed tool usage metrics become critical when:

the agent heavily uses external APIs or DBs;
retries happen often;
tool costs must be controlled;
you need to detect tool spam before users are impacted.

Implementation Example

Below is a simplified Prometheus-style instrumentation example for tool usage metrics. The example covers baseline control: call volume, latency, error classes, repeats, and run-level tool load.

PYTHON

import hashlib
import json
import time
from prometheus_client import Counter, Histogram

TOOL_CALL_TOTAL = Counter(
    "agent_tool_call_total",
    "Total tool calls",
    ["tool", "status", "release"],
)

TOOL_ERROR_TOTAL = Counter(
    "agent_tool_error_total",
    "Total tool errors by class",
    ["tool", "error_class", "release"],
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOOL_CALLS_PER_RUN = Histogram(
    "agent_tool_calls_per_run",
    "Number of tool calls per run",
    ["release"],
    buckets=(0, 1, 2, 4, 8, 12, 16, 24, 32),
)

UNIQUE_TOOLS_PER_RUN = Histogram(
    "agent_unique_tools_per_run",
    "Number of unique tools used in run",
    ["release"],
    buckets=(0, 1, 2, 3, 4, 6, 8, 12),
)

REPEATED_TOOL_CALL_TOTAL = Counter(
    "agent_repeated_tool_call_total",
    "Repeated tool calls with same tool+args signature",
    ["tool", "release"],
)

TOOL_COST_USD_TOTAL = Counter(
    "agent_tool_cost_usd_total",
    "Estimated total tool cost in USD",
    ["tool", "release"],
)

STEP_ERROR_TOTAL = Counter(
    "agent_step_error_total",
    "Total non-tool step errors by type and class",
    ["step_type", "error_class", "release"],
)


def stable_hash(value):
    # default=str gives baseline compatibility;
    # in critical systems explicit serialization is better (for example ISO 8601)
    payload = json.dumps(value, sort_keys=True, ensure_ascii=False, default=str).encode("utf-8")
    return hashlib.sha256(payload).hexdigest()


def run_agent(agent, task, release="2026-03-21"):
    tool_calls = 0
    unique_tools = set()
    seen_signatures = set()

    try:
        for step in agent.iter(task):
            step_type = step.type
            result = None

            if step_type != "tool_call":
                try:
                    result = step.execute()
                except Exception as error:
                    STEP_ERROR_TOTAL.labels(
                        step_type=step_type,
                        error_class=type(error).__name__,
                        release=release,
                    ).inc()
                    raise

                if result and result.is_final:
                    break
                continue

            tool_name = getattr(step, "tool_name", "unknown")
            args = getattr(step, "args", {})

            tool_calls += 1
            unique_tools.add(tool_name)

            signature = (tool_name, stable_hash(args))
            if signature in seen_signatures:
                REPEATED_TOOL_CALL_TOTAL.labels(tool=tool_name, release=release).inc()
            else:
                seen_signatures.add(signature)

            started_at = time.time()
            try:
                result = step.execute()
                TOOL_CALL_TOTAL.labels(tool=tool_name, status="ok", release=release).inc()
                cost_usd = getattr(result, "cost_usd", None)
                if cost_usd:
                    TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(cost_usd)
            except Exception as error:
                TOOL_CALL_TOTAL.labels(tool=tool_name, status="error", release=release).inc()
                TOOL_ERROR_TOTAL.labels(
                    tool=tool_name,
                    error_class=type(error).__name__,
                    release=release,
                ).inc()
                # This example raises.
                # In real agents, the error is often passed to the LLM as observation for retry.
                raise
            finally:
                TOOL_LATENCY_MS.labels(tool=tool_name, release=release).observe(
                    (time.time() - started_at) * 1000
                )

            if result and result.is_final:
                break
    finally:
        TOOL_CALLS_PER_RUN.labels(release=release).observe(tool_calls)
        UNIQUE_TOOLS_PER_RUN.labels(release=release).observe(len(unique_tools))

# tool_cost_per_run is usually computed on dashboard level:
# sum(agent_tool_cost_usd_total) / run_count

Here is how these metrics can look together on a real dashboard:

Tool	calls/min	error_rate	p95 latency	Status
search_docs	320	6.8%	1.9s	critical: alert
fetch_url	180	1.4%	680ms	warning: p95 growing
db_lookup	95	0.3%	120ms	ok

For error_class, use a normalized value dictionary to avoid unnecessary cardinality.

Investigation

When an alert fires:

find the anomalous tool in metrics;
inspect concrete runs in tracing;
check arguments and responses in logs;
find root cause (tool, agent logic, or external API).

Common Mistakes

Even when tool metrics are already added, they often fail because of common mistakes below.

Total calls exist, but no per-tool breakdown

tool_calls_total without per-tool split is almost useless in incidents. In this case it is hard to quickly find the source of tool failure.

Repeated calls are not tracked

Without repeated_tool_calls, it is hard to see that the agent calls the same tool with the same args. This often hides the early phase of tool spam.

No p95 latency by tool

The system can look stable while some users already wait 5+ seconds. For tool layer, minimum baseline is p50 and p95.

High-cardinality labels

Adding run_id, request_id, or args_hash to labels quickly overloads metric backends. Keep these in logs, not in labels.

No tool-layer alerts

Without alerts, metrics remain passive telemetry. This makes it easy to miss early signals of budget explosion caused by excessive external API calls.

Self-Check

Below is a short checklist of baseline tool usage metrics before release.

tool_calls_total is tracked with per-tool and status breakdown
tool_error_total is tracked with normalized error_class
tool_latency_p95 exists for every critical tool
tool_calls_per_run and unique_tools_per_run are tracked
repeated_tool_calls is tracked for identical args
Metrics are segmented by release for release comparison
Labels do not include run_id, request_id, or args_hash
Alerts exist for tool_error_rate and p95 latency spikes
Tool issues are correlated with logs and tracing

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How are tool usage metrics different from general agent metrics?
A: General metrics show overall system state. Tool usage metrics show what is happening specifically in the tool layer.

Q: What is the minimum tool-metric set to start with?
A: Start with tool_calls_total, tool_error_rate, tool_latency_p95, and tool_calls_per_run.

Q: Should args_hash be added to labels?
A: No. It almost always creates high cardinality. For this kind of data, use structured logs.

Q: How do you separate a one-off failure from a system-level tool-layer issue?
A: Check whether the issue repeats for a specific tool across multiple runs and releases. If the same signals (error_class, latency, repeated_tool_calls) repeat, it is systemic.

Next on this topic:

Agent Metrics — overall metrics model for agent systems.
Agent Logging — events needed for incident analysis.
Agent Tracing — one run path step by step.
Semantic Logging for Agents — stable event vocabulary for analytics.
AI Agent Cost Monitoring — cost control in production.

Tool Usage Metrics

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Tool Metrics

How To Read The Tool Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Total calls exist, but no per-tool breakdown

Repeated calls are not tracked

No p95 latency by tool

High-cardinality labels

No tool-layer alerts

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Tool Usage Metrics

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Tool Metrics

How To Read The Tool Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Total calls exist, but no per-tool breakdown

Repeated calls are not tracked

No p95 latency by tool

High-cardinality labels

No tool-layer alerts

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note