Agent Cost Monitoring

Cost monitoring for AI agents: track model and tool spend per run, set budgets, and alert early before budget explosions.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Typical Production Cost Metrics
  5. How To Read The Cost Layer
  6. When To Use
  7. Implementation Example
  8. Investigation
  9. Common Mistakes
  10. Only one daily total spend metric exists
  11. Only token cost is tracked, tool cost is ignored
  12. No release and model breakdown
  13. High-cardinality labels
  14. No burn-rate and cost_p95 alerts
  15. Self-Check
  16. FAQ
  17. Related Pages

Idea In 30 Seconds

Cost monitoring for AI agents shows not only total spend, but also where it grows: LLM tokens, external APIs, and repeated agent steps.

Without it, it is easy to miss the point when the system still "works" but is already too expensive for production.

Core Problem

One run can finish successfully but cost 2-3x more than usual.

Two requests with the same final answer can have different unit cost because of extra reasoning steps, retries, or unnecessary tool calls. Without cost monitoring, this is usually visible only after budget overrun.

Next, we break down how to read these signals and find what makes a run expensive.

In production this often looks like:

  • tokens grow in waves after a release;
  • one tool quietly starts consuming most of the budget;
  • retries increase spend even without traffic growth;
  • the team sees the issue only after budget explosion.

That is why the cost layer should be monitored separately, not only through general run metrics.

How It Works

Cost monitoring is built around two signal groups:

  • usage signals (prompt_tokens, completion_tokens, tool_calls, retries);
  • cost signals (llm_cost_usd, tool_cost_usd, total_cost_per_run).

These metrics answer "where and why the system gets more expensive over time". Logs and tracing are needed to explain one concrete expensive run.

Costs grow not only because of traffic volume, but also because of agent behavior. Usage != cost. An agent can solve the same task with the same result but cost much more because of retries, longer reasoning chains, or expensive tools.

Typical Production Cost Metrics

MetricWhat it showsWhy it matters
token_usage_per_runhow many tokens one run consumesbaseline LLM usage control
llm_cost_per_runLLM cost per runcomparison of models and prompt strategies
tool_cost_per_runexternal API/tool cost per rundetection of expensive tools
total_cost_per_runtotal run costanswer unit-cost control
cost_p95tail of expensive runsearly detection of expensive anomalies
budget_burn_ratebudget consumption speedbudget-overrun forecasting
cost_per_1k_runscost of 1000 runsrelease-to-release stability comparison

budget_burn_rate is usually computed on dashboard level (cost per unit time), not as a separate runtime counter.

To keep metrics practical, they are usually segmented by release, model, tool, and request type.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metrics storage will overload quickly.

How To Read The Cost Layer

What is consumed -> how the agent behaves -> how much it actually costs. These are three levels you should always read together.

Always track time trends and release-to-release differences, not single values.

Now look at signal combinations:

  • token_usage_per_run ↑ + llm_cost_per_run ↑ -> agent spends more tokens per run;
  • tool_cost_per_run ↑ + total_cost_per_run ↑ -> over-tooling or an expensive tool path;
  • llm_cost_per_run ↑ + tool_cost_per_run ~= stable -> issue in prompt or reasoning, not tools;
  • budget_burn_rate ↑ + run_count ~= stable -> unit-cost regression;
  • cost_p95 ↑ + budget_burn_rate ↑ -> expensive anomalous runs are accelerating spend.

When To Use

Full cost monitoring is not always required.

For a simple prototype, baseline token_usage and a daily spend limit can be enough.

But detailed cost monitoring becomes critical when:

  • the system is already in production with budget constraints;
  • the agent uses multiple tools with paid APIs;
  • releases are frequent and cost regressions must be visible;
  • traffic must scale without losing budget control.

Implementation Example

Below is a simplified Prometheus-style cost metrics instrumentation example. It shows baseline control of LLM cost, tool cost, and total run price.

PYTHON
import time
from prometheus_client import Counter, Histogram

RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)

LLM_COST_USD_TOTAL = Counter(
    "agent_llm_cost_usd_total",
    "Total LLM cost in USD",
    ["model", "release"],
)

TOOL_COST_USD_TOTAL = Counter(
    "agent_tool_cost_usd_total",
    "Total tool/API cost in USD",
    ["tool", "release"],
)

TOKEN_USAGE_TOTAL = Counter(
    "agent_token_usage_total",
    "Total LLM tokens",
    ["model", "token_type", "release"],
)

RUN_COST_USD = Histogram(
    "agent_run_cost_usd",
    "Cost per run in USD",
    ["release"],
    buckets=(0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3),
)

BUDGET_BREACH_TOTAL = Counter(
    "agent_budget_breach_total",
    "Total runs that crossed cost budget",
    ["release"],
)

LLM_PRICING = {
    # WARNING: example pricing (can be outdated).
    # In production, load these values from config or provider API.
    "gpt-4.1": {
        "prompt": 0.0000015,
        "completion": 0.0000020,
    }
}


def estimate_llm_cost_usd(model, prompt_tokens, completion_tokens):
    # WARNING: replace with actual provider pricing
    pricing = LLM_PRICING.get(model)
    if not pricing:
        # WARNING: unknown model β€” cost will be reported as 0 (underestimated)
        return 0.0
    prompt_cost = prompt_tokens * pricing.get("prompt", 0)
    completion_cost = completion_tokens * pricing.get("completion", 0)
    return prompt_cost + completion_cost


def run_agent(agent, task, budget_limit_usd=0.25, release="2026-03-21"):
    run_status = "ok"
    stop_reason = "max_steps"
    run_cost_usd = 0.0

    try:
        for step in agent.iter(task):
            step_type = step.type
            try:
                result = step.execute()
            except Exception as error:
                run_status = "error"

                if step_type == "tool_call":
                    stop_reason = "tool_error"
                elif step_type == "llm_generate":
                    stop_reason = "llm_error"
                else:
                    stop_reason = "step_error"

                raise

            if step_type == "llm_generate":
                model = getattr(step, "model", "unknown")
                usage = getattr(result, "token_usage", {}) or {}
                prompt_tokens = usage.get("prompt_tokens", 0)
                completion_tokens = usage.get("completion_tokens", 0)

                TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
                    prompt_tokens
                )
                TOKEN_USAGE_TOTAL.labels(
                    model=model, token_type="completion", release=release
                ).inc(completion_tokens)

                llm_cost = estimate_llm_cost_usd(model, prompt_tokens, completion_tokens)
                run_cost_usd += llm_cost
                LLM_COST_USD_TOTAL.labels(model=model, release=release).inc(llm_cost)

            if step_type == "tool_call":
                tool_name = getattr(step, "tool_name", "unknown")
                tool_cost = float(getattr(result, "cost_usd", 0.0) or 0.0)
                run_cost_usd += tool_cost
                TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(tool_cost)

            if result and result.is_final:
                stop_reason = "completed"
                break
    finally:
        RUN_COST_USD.labels(release=release).observe(run_cost_usd)
        if run_cost_usd > budget_limit_usd:
            BUDGET_BREACH_TOTAL.labels(release=release).inc()
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()

# cost_per_1k_runs is usually computed on dashboard level:
# (sum(agent_run_cost_usd) / run_count) * 1000
# budget_burn_rate is usually computed on dashboard level:
# cost per unit time (for example USD/hour), not as a separate counter in code.

Here is how these metrics can look together on a real dashboard:

Segmentcost_per_runcost_p95burn_rate (hour)Status
gpt-4.1 + tools$0.084$0.29$42/hcritical: budget risk
mini-model + cache$0.021$0.07$11/hok
research workflow$0.136$0.41$58/hwarning: p95 growing

Investigation

When a cost alert fires:

  1. find the anomalous segment (model, tool, release);
  2. inspect expensive runs in tracing;
  3. check retries, stop_reason, and tool path in logs;
  4. find root cause (prompt, agent logic, expensive API, wrong routing).

Common Mistakes

Even when cost metrics exist, they often fail because of common mistakes below.

Only one daily total spend metric exists

A daily total does not show which run or segment became more expensive. Without cost_per_run and cost_p95, issues are usually detected too late.

Only token cost is tracked, tool cost is ignored

In many agent systems, external API calls are the expensive part. Without tool_cost_per_run, it is easy to miss early budget explosion.

No release and model breakdown

Without segmentation, it is hard to prove that a new release or model increased unit cost.

High-cardinality labels

Adding run_id, request_id, or session_id to labels quickly overloads metric backends. Keep this data in logs and tracing.

No burn-rate and cost_p95 alerts

Without alerts, problems accumulate silently until they hit the budget. This often appears together with token overuse.

Self-Check

Below is a short checklist for baseline cost monitoring before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is cost monitoring different from token monitoring?
A: Tokens are only one part of spend. Cost monitoring includes LLM tokens, paid tool/API calls, and total run unit cost.

Q: What is the minimum cost-metric set to start with?
A: Start with token_usage_per_run, total_cost_per_run, cost_p95, and budget_burn_rate.

Q: How to compute cost_per_run with multiple providers?
A: Normalize all step costs to one currency (usually USD) and sum LLM + tool costs inside one run.

Q: How to separate traffic growth from unit-cost regression?
A: Check run_count and cost_per_run together. If traffic is stable but cost_per_run grows, that is unit-cost regression.

Next on this topic:

⏱️ 8 min read β€’ Updated March 22, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.