Idea In 30 Seconds
Cost monitoring for AI agents shows not only total spend, but also where it grows: LLM tokens, external APIs, and repeated agent steps.
Without it, it is easy to miss the point when the system still "works" but is already too expensive for production.
Core Problem
One run can finish successfully but cost 2-3x more than usual.
Two requests with the same final answer can have different unit cost because of extra reasoning steps, retries, or unnecessary tool calls. Without cost monitoring, this is usually visible only after budget overrun.
Next, we break down how to read these signals and find what makes a run expensive.
In production this often looks like:
- tokens grow in waves after a release;
- one tool quietly starts consuming most of the budget;
- retries increase spend even without traffic growth;
- the team sees the issue only after budget explosion.
That is why the cost layer should be monitored separately, not only through general run metrics.
How It Works
Cost monitoring is built around two signal groups:
- usage signals (
prompt_tokens,completion_tokens,tool_calls,retries); - cost signals (
llm_cost_usd,tool_cost_usd,total_cost_per_run).
These metrics answer "where and why the system gets more expensive over time". Logs and tracing are needed to explain one concrete expensive run.
Costs grow not only because of traffic volume, but also because of agent behavior. Usage != cost. An agent can solve the same task with the same result but cost much more because of retries, longer reasoning chains, or expensive tools.
Typical Production Cost Metrics
| Metric | What it shows | Why it matters |
|---|---|---|
| token_usage_per_run | how many tokens one run consumes | baseline LLM usage control |
| llm_cost_per_run | LLM cost per run | comparison of models and prompt strategies |
| tool_cost_per_run | external API/tool cost per run | detection of expensive tools |
| total_cost_per_run | total run cost | answer unit-cost control |
| cost_p95 | tail of expensive runs | early detection of expensive anomalies |
| budget_burn_rate | budget consumption speed | budget-overrun forecasting |
| cost_per_1k_runs | cost of 1000 runs | release-to-release stability comparison |
budget_burn_rate is usually computed on dashboard level (cost per unit time), not as a separate runtime counter.
To keep metrics practical, they are usually segmented by release, model, tool, and request type.
Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metrics storage will overload quickly.
How To Read The Cost Layer
What is consumed -> how the agent behaves -> how much it actually costs. These are three levels you should always read together.
Always track time trends and release-to-release differences, not single values.
Now look at signal combinations:
token_usage_per_runβ +llm_cost_per_runβ -> agent spends more tokens per run;tool_cost_per_runβ +total_cost_per_runβ -> over-tooling or an expensive tool path;llm_cost_per_runβ +tool_cost_per_run~= stable -> issue in prompt or reasoning, not tools;budget_burn_rateβ +run_count~= stable -> unit-cost regression;cost_p95β +budget_burn_rateβ -> expensive anomalous runs are accelerating spend.
When To Use
Full cost monitoring is not always required.
For a simple prototype, baseline token_usage and a daily spend limit can be enough.
But detailed cost monitoring becomes critical when:
- the system is already in production with budget constraints;
- the agent uses multiple tools with paid APIs;
- releases are frequent and cost regressions must be visible;
- traffic must scale without losing budget control.
Implementation Example
Below is a simplified Prometheus-style cost metrics instrumentation example. It shows baseline control of LLM cost, tool cost, and total run price.
import time
from prometheus_client import Counter, Histogram
RUN_TOTAL = Counter(
"agent_run_total",
"Total number of agent runs",
["status", "stop_reason", "release"],
)
LLM_COST_USD_TOTAL = Counter(
"agent_llm_cost_usd_total",
"Total LLM cost in USD",
["model", "release"],
)
TOOL_COST_USD_TOTAL = Counter(
"agent_tool_cost_usd_total",
"Total tool/API cost in USD",
["tool", "release"],
)
TOKEN_USAGE_TOTAL = Counter(
"agent_token_usage_total",
"Total LLM tokens",
["model", "token_type", "release"],
)
RUN_COST_USD = Histogram(
"agent_run_cost_usd",
"Cost per run in USD",
["release"],
buckets=(0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3),
)
BUDGET_BREACH_TOTAL = Counter(
"agent_budget_breach_total",
"Total runs that crossed cost budget",
["release"],
)
LLM_PRICING = {
# WARNING: example pricing (can be outdated).
# In production, load these values from config or provider API.
"gpt-4.1": {
"prompt": 0.0000015,
"completion": 0.0000020,
}
}
def estimate_llm_cost_usd(model, prompt_tokens, completion_tokens):
# WARNING: replace with actual provider pricing
pricing = LLM_PRICING.get(model)
if not pricing:
# WARNING: unknown model β cost will be reported as 0 (underestimated)
return 0.0
prompt_cost = prompt_tokens * pricing.get("prompt", 0)
completion_cost = completion_tokens * pricing.get("completion", 0)
return prompt_cost + completion_cost
def run_agent(agent, task, budget_limit_usd=0.25, release="2026-03-21"):
run_status = "ok"
stop_reason = "max_steps"
run_cost_usd = 0.0
try:
for step in agent.iter(task):
step_type = step.type
try:
result = step.execute()
except Exception as error:
run_status = "error"
if step_type == "tool_call":
stop_reason = "tool_error"
elif step_type == "llm_generate":
stop_reason = "llm_error"
else:
stop_reason = "step_error"
raise
if step_type == "llm_generate":
model = getattr(step, "model", "unknown")
usage = getattr(result, "token_usage", {}) or {}
prompt_tokens = usage.get("prompt_tokens", 0)
completion_tokens = usage.get("completion_tokens", 0)
TOKEN_USAGE_TOTAL.labels(model=model, token_type="prompt", release=release).inc(
prompt_tokens
)
TOKEN_USAGE_TOTAL.labels(
model=model, token_type="completion", release=release
).inc(completion_tokens)
llm_cost = estimate_llm_cost_usd(model, prompt_tokens, completion_tokens)
run_cost_usd += llm_cost
LLM_COST_USD_TOTAL.labels(model=model, release=release).inc(llm_cost)
if step_type == "tool_call":
tool_name = getattr(step, "tool_name", "unknown")
tool_cost = float(getattr(result, "cost_usd", 0.0) or 0.0)
run_cost_usd += tool_cost
TOOL_COST_USD_TOTAL.labels(tool=tool_name, release=release).inc(tool_cost)
if result and result.is_final:
stop_reason = "completed"
break
finally:
RUN_COST_USD.labels(release=release).observe(run_cost_usd)
if run_cost_usd > budget_limit_usd:
BUDGET_BREACH_TOTAL.labels(release=release).inc()
RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()
# cost_per_1k_runs is usually computed on dashboard level:
# (sum(agent_run_cost_usd) / run_count) * 1000
# budget_burn_rate is usually computed on dashboard level:
# cost per unit time (for example USD/hour), not as a separate counter in code.
Here is how these metrics can look together on a real dashboard:
| Segment | cost_per_run | cost_p95 | burn_rate (hour) | Status |
|---|---|---|---|---|
| gpt-4.1 + tools | $0.084 | $0.29 | $42/h | critical: budget risk |
| mini-model + cache | $0.021 | $0.07 | $11/h | ok |
| research workflow | $0.136 | $0.41 | $58/h | warning: p95 growing |
Investigation
When a cost alert fires:
- find the anomalous segment (
model,tool,release); - inspect expensive runs in tracing;
- check retries, stop_reason, and tool path in logs;
- find root cause (prompt, agent logic, expensive API, wrong routing).
Common Mistakes
Even when cost metrics exist, they often fail because of common mistakes below.
Only one daily total spend metric exists
A daily total does not show which run or segment became more expensive.
Without cost_per_run and cost_p95, issues are usually detected too late.
Only token cost is tracked, tool cost is ignored
In many agent systems, external API calls are the expensive part.
Without tool_cost_per_run, it is easy to miss early budget explosion.
No release and model breakdown
Without segmentation, it is hard to prove that a new release or model increased unit cost.
High-cardinality labels
Adding run_id, request_id, or session_id to labels quickly overloads metric backends.
Keep this data in logs and tracing.
No burn-rate and cost_p95 alerts
Without alerts, problems accumulate silently until they hit the budget. This often appears together with token overuse.
Self-Check
Below is a short checklist for baseline cost monitoring before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is cost monitoring different from token monitoring?
A: Tokens are only one part of spend. Cost monitoring includes LLM tokens, paid tool/API calls, and total run unit cost.
Q: What is the minimum cost-metric set to start with?
A: Start with token_usage_per_run, total_cost_per_run, cost_p95, and budget_burn_rate.
Q: How to compute cost_per_run with multiple providers?
A: Normalize all step costs to one currency (usually USD) and sum LLM + tool costs inside one run.
Q: How to separate traffic growth from unit-cost regression?
A: Check run_count and cost_per_run together. If traffic is stable but cost_per_run grows, that is unit-cost regression.
Related Pages
Next on this topic:
- Agent Metrics β overall metrics model for agent systems.
- Tool Usage Metrics β cost and reliability control on tool level.
- Agent Tracing β how to find expensive steps inside one run.
- Agent Logging β events for cost-incident analysis.
- Agent Latency Monitoring β relation between latency and cost.