Agent Latency Monitoring

Latency monitoring for AI agents: measure p50/p95/p99 across model, tools, and orchestration steps to control user-facing delays.
On this page
  1. Idea In 30 Seconds
  2. Core Problem
  3. How It Works
  4. Typical Production Latency Metrics
  5. How To Read The Latency Layer
  6. When To Use
  7. Implementation Example
  8. Investigation
  9. Common Mistakes
  10. Only average latency is tracked
  11. No latency breakdown by tools and step type
  12. Queue time is ignored
  13. No timeout-rate and retry-overhead metrics
  14. No alerts for p95/p99 and timeout spikes
  15. Self-Check
  16. FAQ
  17. Related Pages

Idea In 30 Seconds

Latency monitoring for AI agents shows where the system slows down: in LLM steps, tools, queues, or repeated iterations.

Without it, it is hard to understand why users wait longer, even when the run formally finishes successfully.

Core Problem

One run can complete correctly but still be too slow for production.

Two requests with the same answer can have different latency because of a longer reasoning chain, a slow tool, or retries. Without latency monitoring, this is usually visible only after user complaints.

Next, we break down how to read these signals and find what exactly slows a run.

In production this often looks like:

  • average latency looks fine, but p95 is already growing;
  • one tool quietly becomes a bottleneck;
  • retries add seconds without visible traffic growth;
  • the team sees the issue only after a partial outage.

That is why the latency layer should be monitored separately, not only through general run metrics.

How It Works

Latency monitoring is built around two types of signals:

  • runtime signals (queue_time, step_latency, tool_latency, ttft);
  • service signals (run_latency_p50/p95/p99, timeout_rate, retry_overhead_ms).

These metrics answer "where and why the system slows down over time". Logs and tracing are needed to explain a concrete slow run.

Latency != user experience. Users do not feel the average - they feel p95/p99. Slow requests define system perception. Latency is often directly linked to cost. Longer runs mean more tokens, more tool calls, and more retries, which directly increase cost.

Typical Production Latency Metrics

MetricWhat it showsWhy it matters
run_latency_p50typical run execution timebaseline speed control
run_latency_p95 / p99long-tail and slowest runsearly degradation detection
step_latency_p95which agent steps slow downlocalization of the problematic stage
tool_latency_p95latency of concrete toolsfinding external bottlenecks
ttft_p95time-to-first-token for LLMcontrol of first-response speed
queue_time_p95how long a run waits before startload and capacity control
timeout_rateshare of steps ending with timeoutearly instability signal
retry_overhead_mshow much time retries addimpact of recovery on latency

run_latency_p95 and run_latency_p99 are usually calculated on dashboard/query level, not as separate counters in code.

To keep metrics practical, they are usually segmented by release, model, tool, and workflow type.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metric storage will overload quickly.

How To Read The Latency Layer

Where delay appears -> how the agent behaves -> what exactly slows a run. These are three levels you should always read together.

Focus on time trends and release-to-release differences, not isolated values.

Now look at signal combinations:

  • run_latency_p95 up + tool_latency_p95 up -> bottleneck in external tools;
  • run_latency_p95 up + step_count up -> agent is doing extra iterations;
  • ttft_p95 up + tool_latency_p95 ~= stable -> issue in LLM layer, not tools;
  • timeout_rate up + retry_overhead_ms up -> retries mask instability and add latency;
  • queue_time_p95 up + run_count up -> system lacks capacity.

When To Use

Full latency monitoring is not always required.

For a simple prototype, basic response time can be enough.

But detailed latency monitoring becomes critical when:

  • the system is already in production and has speed SLO/SLA;
  • the agent uses several external tools and dependencies;
  • releases are frequent and latency regressions must be visible;
  • workflows include queues, retries, or multi-step reasoning loops.

Implementation Example

Below is a simplified Prometheus-style latency metrics instrumentation example. The example shows baseline control: run latency, steps, tools, timeouts, and retry overhead.

PYTHON
import time
from prometheus_client import Counter, Histogram

# perf_counter() is used instead of time.time()
# to get precise monotonic latency measurements
RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)

RUN_LATENCY_MS = Histogram(
    "agent_run_latency_ms",
    "Run latency in milliseconds",
    ["release"],
    buckets=(100, 250, 500, 1000, 2000, 5000, 10000, 20000),
)

STEP_LATENCY_MS = Histogram(
    "agent_step_latency_ms",
    "Latency by step type in milliseconds",
    ["step_type", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

QUEUE_TIME_MS = Histogram(
    "agent_queue_time_ms",
    "Queue wait time before run start",
    ["release"],
    buckets=(0, 20, 50, 100, 250, 500, 1000, 2000),
)

TTFT_MS = Histogram(
    "agent_ttft_ms",
    "Time to first token in milliseconds",
    ["model", "release"],
    buckets=(50, 100, 200, 400, 800, 1500, 3000),
)

TIMEOUT_TOTAL = Counter(
    "agent_timeout_total",
    "Total timeout errors by layer",
    ["layer", "release"],
)

RETRY_OVERHEAD_MS = Histogram(
    "agent_retry_overhead_ms",
    "Added latency from retries",
    ["release"],
    buckets=(0, 50, 100, 250, 500, 1000, 2000, 5000),
)


def run_agent(agent, task, queue_time_ms=0, release="2026-03-22"):
    run_status = "ok"
    stop_reason = "max_steps"
    started_at = time.perf_counter()

    if queue_time_ms > 0:
        QUEUE_TIME_MS.labels(release=release).observe(queue_time_ms)

    try:
        for step in agent.iter(task):
            step_type = step.type
            step_started_at = time.perf_counter()

            try:
                result = step.execute()
            except TimeoutError:
                run_status = "error"
                if step_type == "tool_call":
                    stop_reason = "tool_timeout"
                elif step_type == "llm_generate":
                    stop_reason = "llm_timeout"
                else:
                    stop_reason = "step_timeout"
                layer = (
                    "tool"
                    if step_type == "tool_call"
                    else "llm"
                    if step_type == "llm_generate"
                    else "runtime"
                )
                TIMEOUT_TOTAL.labels(layer=layer, release=release).inc()
                raise
            except Exception:
                run_status = "error"
                if step_type == "tool_call":
                    stop_reason = "tool_error"
                elif step_type == "llm_generate":
                    stop_reason = "llm_error"
                else:
                    stop_reason = "step_error"
                raise
            finally:
                step_latency_ms = (time.perf_counter() - step_started_at) * 1000
                STEP_LATENCY_MS.labels(step_type=step_type, release=release).observe(step_latency_ms)
                if step_type == "tool_call":
                    TOOL_LATENCY_MS.labels(
                        tool=getattr(step, "tool_name", "unknown"),
                        release=release,
                    ).observe(step_latency_ms)
                # retry overhead can exist even if the step failed
                retry_overhead_ms = float(getattr(step, "retry_overhead_ms", 0) or 0)
                if retry_overhead_ms > 0:
                    RETRY_OVERHEAD_MS.labels(release=release).observe(retry_overhead_ms)

            if step_type == "llm_generate":
                model = getattr(step, "model", "unknown")
                ttft_ms = float(getattr(result, "ttft_ms", 0) or 0)
                if ttft_ms > 0:
                    TTFT_MS.labels(model=model, release=release).observe(ttft_ms)

            if result and result.is_final:
                stop_reason = "completed"
                break
    finally:
        run_latency_ms = (time.perf_counter() - started_at) * 1000
        RUN_LATENCY_MS.labels(release=release).observe(run_latency_ms)
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()

# run_latency_p95 and run_latency_p99 are usually computed on dashboard level:
# histogram_quantile(...) over agent_run_latency_ms bucket metrics.

Here is how these metrics can look together on a real dashboard:

Segmentp50 latencyp95 latencytimeout_rateStatus
gpt-4.1 + tools1.1s4.8s2.9%critical: SLO risk
mini-model + cache420ms1.2s0.4%ok
research workflow1.7s6.1s1.8%warning: p95 growing

Investigation

When a latency alert fires:

  1. find the anomalous segment (release, tool, model);
  2. inspect slow runs in tracing;
  3. check retries, timeout, and stop_reason in logs;
  4. find root cause (tool, LLM, queue, agent logic, external service).

Common Mistakes

Even when latency metrics are added, they often fail because of common mistakes below.

Only average latency is tracked

Average latency often hides degradation. For production, minimum is p50 and p95, and for critical flows also p99.

No latency breakdown by tools and step type

Without this, it is hard to know what is slow: LLM, tool, or agent loop itself. In such cases, it is hard to quickly localize tool failure.

Queue time is ignored

A run can be slow before execution even starts. Without queue_time_p95, capacity issues are easy to miss.

No timeout-rate and retry-overhead metrics

Retries can mask instability and artificially inflate latency. This often comes together with tool spam.

No alerts for p95/p99 and timeout spikes

Without alerts, the team learns about issues too late, when SLO is already violated.

Self-Check

Below is a short checklist of baseline latency monitoring before release.

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is latency monitoring different from regular API speed monitoring?
A: For agents, you must monitor not only total response time but also internal steps: reasoning, tools, retries, and queue time.

Q: What is the minimum latency-metric set to start with?
A: Start with run_latency_p50/p95, tool_latency_p95, timeout_rate, and queue_time_p95.

Q: Why is p95 more important than average latency?
A: Because p95 shows what happens with slow requests, which users notice most.

Q: How to separate tool-latency issues from LLM-latency issues?
A: Compare tool_latency_p95 and ttft_p95: if only tool latency grows, bottleneck is in tools; if ttft grows, the issue is in LLM layer.

Next on this topic:

⏱️ 8 min read β€’ Updated March 22, 2026Difficulty: β˜…β˜…β˜…
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.