Agent Cost Monitoring

Idea In 30 Seconds

Cost monitoring for AI agents shows not only total spend, but also where it grows: LLM tokens, external APIs, and repeated agent steps.

Without it, it is easy to miss the point when the system still "works" but is already too expensive for production.

Core Problem

One run can finish successfully but cost 2-3x more than usual.

Two requests with the same final answer can have different unit cost because of extra reasoning steps, retries, or unnecessary tool calls. Without cost monitoring, this is usually visible only after budget overrun.

Next, we break down how to read these signals and find what makes a run expensive.

In production this often looks like:

tokens grow in waves after a release;
one tool quietly starts consuming most of the budget;
retries increase spend even without traffic growth;
the team sees the issue only after budget explosion.

That is why the cost layer should be monitored separately, not only through general run metrics.

How It Works

Cost monitoring is built around two signal groups:

usage signals (prompt_tokens, completion_tokens, tool_calls, retries);
cost signals (llm_cost_usd, tool_cost_usd, total_cost_per_run).

These metrics answer "where and why the system gets more expensive over time". Logs and tracing are needed to explain one concrete expensive run.

Costs grow not only because of traffic volume, but also because of agent behavior. Usage != cost. An agent can solve the same task with the same result but cost much more because of retries, longer reasoning chains, or expensive tools.

Typical Production Cost Metrics

Metric	What it shows	Why it matters
token_usage_per_run	how many tokens one run consumes	baseline LLM usage control
llm_cost_per_run	LLM cost per run	comparison of models and prompt strategies
tool_cost_per_run	external API/tool cost per run	detection of expensive tools
total_cost_per_run	total run cost	answer unit-cost control
cost_p95	tail of expensive runs	early detection of expensive anomalies
budget_burn_rate	budget consumption speed	budget-overrun forecasting
cost_per_1k_runs	cost of 1000 runs	release-to-release stability comparison

budget_burn_rate is usually computed on dashboard level (cost per unit time), not as a separate runtime counter.

To keep metrics practical, they are usually segmented by release, model, tool, and request type.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metrics storage will overload quickly.

How To Read The Cost Layer

What is consumed -> how the agent behaves -> how much it actually costs. These are three levels you should always read together.

Always track time trends and release-to-release differences, not single values.

Now look at signal combinations:

token_usage_per_run ↑ + llm_cost_per_run ↑ -> agent spends more tokens per run;
tool_cost_per_run ↑ + total_cost_per_run ↑ -> over-tooling or an expensive tool path;
llm_cost_per_run ↑ + tool_cost_per_run ~= stable -> issue in prompt or reasoning, not tools;
budget_burn_rate ↑ + run_count ~= stable -> unit-cost regression;
cost_p95 ↑ + budget_burn_rate ↑ -> expensive anomalous runs are accelerating spend.

When To Use

Full cost monitoring is not always required.

For a simple prototype, baseline token_usage and a daily spend limit can be enough.

But detailed cost monitoring becomes critical when:

the system is already in production with budget constraints;
the agent uses multiple tools with paid APIs;
releases are frequent and cost regressions must be visible;
traffic must scale without losing budget control.

Implementation Example

Below is a simplified Prometheus-style cost metrics instrumentation example. It shows baseline control of LLM cost, tool cost, and total run price.

PYTHON

import time
from prometheus_client import Counter, Histogram

RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)

LLM_COST_USD_TOTAL = Counter(
    "agent_llm_cost_usd_total",
    "Total LLM cost in USD",
    ["model", "release"],
)

TOOL_COST_USD_TOTAL = Counter(
    "agent_tool_cost_usd_total",
    "Total tool/API cost in USD",
    ["tool", "release"],
)

TOKEN_USAGE_TOTAL = Counter(
    "agent_token_usage_total",
    "Total LLM tokens",
    ["model", "token_type", "release"],
)

RUN_COST_USD = Histogram(
    "agent_run_cost_usd",
    "Cost per run in USD",
    ["release"],
    buckets=(0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3),
)

BUDGET_BREACH_TOTAL = Counter(
    "agent_budget_breach_total",
    "Total runs that crossed cost budget",
    ["release"],
)

LLM_PRICING = {
    # WARNING: example pricing (can be outdated).
    # In production, load these values from config or provider API.
    "gpt-4.1": {
        "prompt": 0.0000015,
        "completion": 0.0000020,
    }
}


def estimate_llm_cost_usd(model, prompt_tokens, completion_tokens):
    # WARNING: replace with actual provider pricing
    pricing = LLM_PRICING.get(model)
    if not pricing:
        # WARNING: unknown model — cost will be reported as 0 (underestimated)
        return 0.0
    prompt_cost = prompt_tokens * pricing.get("prompt", 0)
    completion_cost = completion_tokens * pricing.get("completion", 0)
    return prompt_cost + completion_cost


def run_agent(agent, task, budget_limit_usd=0.25, release="2026-03-21"):
    run_status = "ok"
    stop_reason = "max_steps"
    run_cost_usd = 0.0

    try:
        for step in agent.iter(task):
            step_type = step.type
            try:
                result = step.execute()
            except Exception as error:
                run_status = "error"

                if step_type == "tool_call":
                    stop_reason = "tool_error"
                elif step_type == "llm_generate":
                    stop_reason = "llm_error"
                else:
                    stop_reason = "step_error"

                raise

            if step_type == "llm_generate":
                model = getattr(step, "model", "unknown")
                usage = getattr(result, "token_usage", {}) or {}
                prompt_tokens = usage.get("prompt_tokens", 0)
                completion_tokens = usage.get("completion_tokens", 0)

                TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
                    prompt_tokens
                )
                TOKEN_USAGE_TOTAL.labels(
                    model=model, token_type="completion", release=release
                ).inc(completion_tokens)

                llm_cost = estimate_llm_cost_usd(model, prompt_tokens, completion_tokens)
                run_cost_usd += llm_cost
                LLM_COST_USD_TOTAL.labels(model=model, release=release).inc(llm_cost)

            if step_type == "tool_call":
                tool_name = getattr(step, "tool_name", "unknown")
                tool_cost = float(getattr(result, "cost_usd", 0.0) or 0.0)
                run_cost_usd += tool_cost
                TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(tool_cost)

            if result and result.is_final:
                stop_reason = "completed"
                break
    finally:
        RUN_COST_USD.labels(release=release).observe(run_cost_usd)
        if run_cost_usd > budget_limit_usd:
            BUDGET_BREACH_TOTAL.labels(release=release).inc()
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()

# cost_per_1k_runs is usually computed on dashboard level:
# (sum(agent_run_cost_usd) / run_count) * 1000
# budget_burn_rate is usually computed on dashboard level:
# cost per unit time (for example USD/hour), not as a separate counter in code.

Here is how these metrics can look together on a real dashboard:

Segment	cost_per_run	cost_p95	burn_rate (hour)	Status
gpt-4.1 + tools	$0.084	$0.29	$42/h	critical: budget risk
mini-model + cache	$0.021	$0.07	$11/h	ok
research workflow	$0.136	$0.41	$58/h	warning: p95 growing

Investigation

When a cost alert fires:

find the anomalous segment (model, tool, release);
inspect expensive runs in tracing;
check retries, stop_reason, and tool path in logs;
find root cause (prompt, agent logic, expensive API, wrong routing).

Common Mistakes

Even when cost metrics exist, they often fail because of common mistakes below.

Only one daily total spend metric exists

A daily total does not show which run or segment became more expensive. Without cost_per_run and cost_p95, issues are usually detected too late.

Only token cost is tracked, tool cost is ignored

In many agent systems, external API calls are the expensive part. Without tool_cost_per_run, it is easy to miss early budget explosion.

No release and model breakdown

Without segmentation, it is hard to prove that a new release or model increased unit cost.

High-cardinality labels

Adding run_id, request_id, or session_id to labels quickly overloads metric backends. Keep this data in logs and tracing.

No burn-rate and cost_p95 alerts

Without alerts, problems accumulate silently until they hit the budget. This often appears together with token overuse.

Self-Check

Below is a short checklist for baseline cost monitoring before release.

total_cost_per_run and cost_p95 exist for production segments
token_usage_per_run is logged with prompt/completion split
llm_cost_per_run and tool_cost_per_run are tracked separately
budget_burn_rate exists for budget-overrun forecasting
Metrics are segmented by release, model, and key workflows
Labels do not include run_id, request_id, or user_id
Alerts exist for burn_rate spikes and cost_p95 regressions
Expensive runs are correlated with tracing and logs
After release, cost_per_1k_runs is compared across versions

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is cost monitoring different from token monitoring?
A: Tokens are only one part of spend. Cost monitoring includes LLM tokens, paid tool/API calls, and total run unit cost.

Q: What is the minimum cost-metric set to start with?
A: Start with token_usage_per_run, total_cost_per_run, cost_p95, and budget_burn_rate.

Q: How to compute cost_per_run with multiple providers?
A: Normalize all step costs to one currency (usually USD) and sum LLM + tool costs inside one run.

Q: How to separate traffic growth from unit-cost regression?
A: Check run_count and cost_per_run together. If traffic is stable but cost_per_run grows, that is unit-cost regression.

Next on this topic:

Agent Metrics — overall metrics model for agent systems.
Tool Usage Metrics — cost and reliability control on tool level.
Agent Tracing — how to find expensive steps inside one run.
Agent Logging — events for cost-incident analysis.
Agent Latency Monitoring — relation between latency and cost.

Agent Cost Monitoring

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Cost Metrics

How To Read The Cost Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Only one daily total spend metric exists

Only token cost is tracked, tool cost is ignored

No release and model breakdown

High-cardinality labels

No burn-rate and cost_p95 alerts

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Agent Cost Monitoring

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Cost Metrics

How To Read The Cost Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Only one daily total spend metric exists

Only token cost is tracked, tool cost is ignored

No release and model breakdown

High-cardinality labels

No burn-rate and cost_p95 alerts

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note