Idea In 30 Seconds
Latency monitoring for AI agents shows where the system slows down: in LLM steps, tools, queues, or repeated iterations.
Without it, it is hard to understand why users wait longer, even when the run formally finishes successfully.
Core Problem
One run can complete correctly but still be too slow for production.
Two requests with the same answer can have different latency because of a longer reasoning chain, a slow tool, or retries. Without latency monitoring, this is usually visible only after user complaints.
Next, we break down how to read these signals and find what exactly slows a run.
In production this often looks like:
- average latency looks fine, but p95 is already growing;
- one tool quietly becomes a bottleneck;
- retries add seconds without visible traffic growth;
- the team sees the issue only after a partial outage.
That is why the latency layer should be monitored separately, not only through general run metrics.
How It Works
Latency monitoring is built around two types of signals:
- runtime signals (
queue_time,step_latency,tool_latency,ttft); - service signals (
run_latency_p50/p95/p99,timeout_rate,retry_overhead_ms).
These metrics answer "where and why the system slows down over time". Logs and tracing are needed to explain a concrete slow run.
Latency != user experience. Users do not feel the average - they feel p95/p99. Slow requests define system perception. Latency is often directly linked to cost. Longer runs mean more tokens, more tool calls, and more retries, which directly increase cost.
Typical Production Latency Metrics
| Metric | What it shows | Why it matters |
|---|---|---|
| run_latency_p50 | typical run execution time | baseline speed control |
| run_latency_p95 / p99 | long-tail and slowest runs | early degradation detection |
| step_latency_p95 | which agent steps slow down | localization of the problematic stage |
| tool_latency_p95 | latency of concrete tools | finding external bottlenecks |
| ttft_p95 | time-to-first-token for LLM | control of first-response speed |
| queue_time_p95 | how long a run waits before start | load and capacity control |
| timeout_rate | share of steps ending with timeout | early instability signal |
| retry_overhead_ms | how much time retries add | impact of recovery on latency |
run_latency_p95 and run_latency_p99 are usually calculated on dashboard/query level, not as separate counters in code.
To keep metrics practical, they are usually segmented by release, model, tool, and workflow type.
Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metric storage will overload quickly.
How To Read The Latency Layer
Where delay appears -> how the agent behaves -> what exactly slows a run. These are three levels you should always read together.
Focus on time trends and release-to-release differences, not isolated values.
Now look at signal combinations:
run_latency_p95up +tool_latency_p95up -> bottleneck in external tools;run_latency_p95up +step_countup -> agent is doing extra iterations;ttft_p95up +tool_latency_p95~= stable -> issue in LLM layer, not tools;timeout_rateup +retry_overhead_msup -> retries mask instability and add latency;queue_time_p95up +run_countup -> system lacks capacity.
When To Use
Full latency monitoring is not always required.
For a simple prototype, basic response time can be enough.
But detailed latency monitoring becomes critical when:
- the system is already in production and has speed SLO/SLA;
- the agent uses several external tools and dependencies;
- releases are frequent and latency regressions must be visible;
- workflows include queues, retries, or multi-step reasoning loops.
Implementation Example
Below is a simplified Prometheus-style latency metrics instrumentation example. The example shows baseline control: run latency, steps, tools, timeouts, and retry overhead.
import time
from prometheus_client import Counter, Histogram
# perf_counter() is used instead of time.time()
# to get precise monotonic latency measurements
RUN_TOTAL = Counter(
"agent_run_total",
"Total number of agent runs",
["status", "stop_reason", "release"],
)
RUN_LATENCY_MS = Histogram(
"agent_run_latency_ms",
"Run latency in milliseconds",
["release"],
buckets=(100, 250, 500, 1000, 2000, 5000, 10000, 20000),
)
STEP_LATENCY_MS = Histogram(
"agent_step_latency_ms",
"Latency by step type in milliseconds",
["step_type", "release"],
buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)
TOOL_LATENCY_MS = Histogram(
"agent_tool_latency_ms",
"Tool latency in milliseconds",
["tool", "release"],
buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)
QUEUE_TIME_MS = Histogram(
"agent_queue_time_ms",
"Queue wait time before run start",
["release"],
buckets=(0, 20, 50, 100, 250, 500, 1000, 2000),
)
TTFT_MS = Histogram(
"agent_ttft_ms",
"Time to first token in milliseconds",
["model", "release"],
buckets=(50, 100, 200, 400, 800, 1500, 3000),
)
TIMEOUT_TOTAL = Counter(
"agent_timeout_total",
"Total timeout errors by layer",
["layer", "release"],
)
RETRY_OVERHEAD_MS = Histogram(
"agent_retry_overhead_ms",
"Added latency from retries",
["release"],
buckets=(0, 50, 100, 250, 500, 1000, 2000, 5000),
)
def run_agent(agent, task, queue_time_ms=0, release="2026-03-22"):
run_status = "ok"
stop_reason = "max_steps"
started_at = time.perf_counter()
if queue_time_ms > 0:
QUEUE_TIME_MS.labels(release=release).observe(queue_time_ms)
try:
for step in agent.iter(task):
step_type = step.type
step_started_at = time.perf_counter()
try:
result = step.execute()
except TimeoutError:
run_status = "error"
if step_type == "tool_call":
stop_reason = "tool_timeout"
elif step_type == "llm_generate":
stop_reason = "llm_timeout"
else:
stop_reason = "step_timeout"
layer = (
"tool"
if step_type == "tool_call"
else "llm"
if step_type == "llm_generate"
else "runtime"
)
TIMEOUT_TOTAL.labels(layer=layer, release=release).inc()
raise
except Exception:
run_status = "error"
if step_type == "tool_call":
stop_reason = "tool_error"
elif step_type == "llm_generate":
stop_reason = "llm_error"
else:
stop_reason = "step_error"
raise
finally:
step_latency_ms = (time.perf_counter() - step_started_at) * 1000
STEP_LATENCY_MS.labels(step_type=step_type, release=release).observe(step_latency_ms)
if step_type == "tool_call":
TOOL_LATENCY_MS.labels(
tool=getattr(step, "tool_name", "unknown"),
release=release,
).observe(step_latency_ms)
# retry overhead can exist even if the step failed
retry_overhead_ms = float(getattr(step, "retry_overhead_ms", 0) or 0)
if retry_overhead_ms > 0:
RETRY_OVERHEAD_MS.labels(release=release).observe(retry_overhead_ms)
if step_type == "llm_generate":
model = getattr(step, "model", "unknown")
ttft_ms = float(getattr(result, "ttft_ms", 0) or 0)
if ttft_ms > 0:
TTFT_MS.labels(model=model, release=release).observe(ttft_ms)
if result and result.is_final:
stop_reason = "completed"
break
finally:
run_latency_ms = (time.perf_counter() - started_at) * 1000
RUN_LATENCY_MS.labels(release=release).observe(run_latency_ms)
RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()
# run_latency_p95 and run_latency_p99 are usually computed on dashboard level:
# histogram_quantile(...) over agent_run_latency_ms bucket metrics.
Here is how these metrics can look together on a real dashboard:
| Segment | p50 latency | p95 latency | timeout_rate | Status |
|---|---|---|---|---|
| gpt-4.1 + tools | 1.1s | 4.8s | 2.9% | critical: SLO risk |
| mini-model + cache | 420ms | 1.2s | 0.4% | ok |
| research workflow | 1.7s | 6.1s | 1.8% | warning: p95 growing |
Investigation
When a latency alert fires:
- find the anomalous segment (
release,tool,model); - inspect slow runs in tracing;
- check retries, timeout, and stop_reason in logs;
- find root cause (tool, LLM, queue, agent logic, external service).
Common Mistakes
Even when latency metrics are added, they often fail because of common mistakes below.
Only average latency is tracked
Average latency often hides degradation.
For production, minimum is p50 and p95, and for critical flows also p99.
No latency breakdown by tools and step type
Without this, it is hard to know what is slow: LLM, tool, or agent loop itself. In such cases, it is hard to quickly localize tool failure.
Queue time is ignored
A run can be slow before execution even starts.
Without queue_time_p95, capacity issues are easy to miss.
No timeout-rate and retry-overhead metrics
Retries can mask instability and artificially inflate latency. This often comes together with tool spam.
No alerts for p95/p99 and timeout spikes
Without alerts, the team learns about issues too late, when SLO is already violated.
Self-Check
Below is a short checklist of baseline latency monitoring before release.
Progress: 0/9
β Baseline observability is missing
The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.
FAQ
Q: How is latency monitoring different from regular API speed monitoring?
A: For agents, you must monitor not only total response time but also internal steps: reasoning, tools, retries, and queue time.
Q: What is the minimum latency-metric set to start with?
A: Start with run_latency_p50/p95, tool_latency_p95, timeout_rate, and queue_time_p95.
Q: Why is p95 more important than average latency?
A: Because p95 shows what happens with slow requests, which users notice most.
Q: How to separate tool-latency issues from LLM-latency issues?
A: Compare tool_latency_p95 and ttft_p95: if only tool latency grows, bottleneck is in tools; if ttft grows, the issue is in LLM layer.
Related Pages
Next on this topic:
- Agent Metrics β overall metrics model for agent systems.
- Tool Usage Metrics β how to isolate latency at tool level.
- Agent Cost Monitoring β how latency is connected to cost.
- Agent Tracing β how to find the slow step in a concrete run.
- Alerting In AI Agents β how to build early notifications.