Agent Latency Monitoring

Idea In 30 Seconds

Latency monitoring for AI agents shows where the system slows down: in LLM steps, tools, queues, or repeated iterations.

Without it, it is hard to understand why users wait longer, even when the run formally finishes successfully.

Core Problem

One run can complete correctly but still be too slow for production.

Two requests with the same answer can have different latency because of a longer reasoning chain, a slow tool, or retries. Without latency monitoring, this is usually visible only after user complaints.

Next, we break down how to read these signals and find what exactly slows a run.

In production this often looks like:

average latency looks fine, but p95 is already growing;
one tool quietly becomes a bottleneck;
retries add seconds without visible traffic growth;
the team sees the issue only after a partial outage.

That is why the latency layer should be monitored separately, not only through general run metrics.

How It Works

Latency monitoring is built around two types of signals:

runtime signals (queue_time, step_latency, tool_latency, ttft);
service signals (run_latency_p50/p95/p99, timeout_rate, retry_overhead_ms).

These metrics answer "where and why the system slows down over time". Logs and tracing are needed to explain a concrete slow run.

Latency != user experience. Users do not feel the average - they feel p95/p99. Slow requests define system perception. Latency is often directly linked to cost. Longer runs mean more tokens, more tool calls, and more retries, which directly increase cost.

Typical Production Latency Metrics

Metric	What it shows	Why it matters
run_latency_p50	typical run execution time	baseline speed control
run_latency_p95 / p99	long-tail and slowest runs	early degradation detection
step_latency_p95	which agent steps slow down	localization of the problematic stage
tool_latency_p95	latency of concrete tools	finding external bottlenecks
ttft_p95	time-to-first-token for LLM	control of first-response speed
queue_time_p95	how long a run waits before start	load and capacity control
timeout_rate	share of steps ending with timeout	early instability signal
retry_overhead_ms	how much time retries add	impact of recovery on latency

run_latency_p95 and run_latency_p99 are usually calculated on dashboard/query level, not as separate counters in code.

To keep metrics practical, they are usually segmented by release, model, tool, and workflow type.

Important: do not add high-cardinality fields (run_id, request_id, user_id) to labels, or metric storage will overload quickly.

How To Read The Latency Layer

Where delay appears -> how the agent behaves -> what exactly slows a run. These are three levels you should always read together.

Focus on time trends and release-to-release differences, not isolated values.

Now look at signal combinations:

run_latency_p95 up + tool_latency_p95 up -> bottleneck in external tools;
run_latency_p95 up + step_count up -> agent is doing extra iterations;
ttft_p95 up + tool_latency_p95 ~= stable -> issue in LLM layer, not tools;
timeout_rate up + retry_overhead_ms up -> retries mask instability and add latency;
queue_time_p95 up + run_count up -> system lacks capacity.

When To Use

Full latency monitoring is not always required.

For a simple prototype, basic response time can be enough.

But detailed latency monitoring becomes critical when:

the system is already in production and has speed SLO/SLA;
the agent uses several external tools and dependencies;
releases are frequent and latency regressions must be visible;
workflows include queues, retries, or multi-step reasoning loops.

Implementation Example

Below is a simplified Prometheus-style latency metrics instrumentation example. The example shows baseline control: run latency, steps, tools, timeouts, and retry overhead.

PYTHON

import time
from prometheus_client import Counter, Histogram

# perf_counter() is used instead of time.time()
# to get precise monotonic latency measurements
RUN_TOTAL = Counter(
    "agent_run_total",
    "Total number of agent runs",
    ["status", "stop_reason", "release"],
)

RUN_LATENCY_MS = Histogram(
    "agent_run_latency_ms",
    "Run latency in milliseconds",
    ["release"],
    buckets=(100, 250, 500, 1000, 2000, 5000, 10000, 20000),
)

STEP_LATENCY_MS = Histogram(
    "agent_step_latency_ms",
    "Latency by step type in milliseconds",
    ["step_type", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

TOOL_LATENCY_MS = Histogram(
    "agent_tool_latency_ms",
    "Tool latency in milliseconds",
    ["tool", "release"],
    buckets=(20, 50, 100, 250, 500, 1000, 2000, 5000),
)

QUEUE_TIME_MS = Histogram(
    "agent_queue_time_ms",
    "Queue wait time before run start",
    ["release"],
    buckets=(0, 20, 50, 100, 250, 500, 1000, 2000),
)

TTFT_MS = Histogram(
    "agent_ttft_ms",
    "Time to first token in milliseconds",
    ["model", "release"],
    buckets=(50, 100, 200, 400, 800, 1500, 3000),
)

TIMEOUT_TOTAL = Counter(
    "agent_timeout_total",
    "Total timeout errors by layer",
    ["layer", "release"],
)

RETRY_OVERHEAD_MS = Histogram(
    "agent_retry_overhead_ms",
    "Added latency from retries",
    ["release"],
    buckets=(0, 50, 100, 250, 500, 1000, 2000, 5000),
)


def run_agent(agent, task, queue_time_ms=0, release="2026-03-22"):
    run_status = "ok"
    stop_reason = "max_steps"
    started_at = time.perf_counter()

    if queue_time_ms > 0:
        QUEUE_TIME_MS.labels(release=release).observe(queue_time_ms)

    try:
        for step in agent.iter(task):
            step_type = step.type
            step_started_at = time.perf_counter()

            try:
                result = step.execute()
            except TimeoutError:
                run_status = "error"
                if step_type == "tool_call":
                    stop_reason = "tool_timeout"
                elif step_type == "llm_generate":
                    stop_reason = "llm_timeout"
                else:
                    stop_reason = "step_timeout"
                layer = (
                    "tool"
                    if step_type == "tool_call"
                    else "llm"
                    if step_type == "llm_generate"
                    else "runtime"
                )
                TIMEOUT_TOTAL.labels(layer=layer, release=release).inc()
                raise
            except Exception:
                run_status = "error"
                if step_type == "tool_call":
                    stop_reason = "tool_error"
                elif step_type == "llm_generate":
                    stop_reason = "llm_error"
                else:
                    stop_reason = "step_error"
                raise
            finally:
                step_latency_ms = (time.perf_counter() - step_started_at) * 1000
                STEP_LATENCY_MS.labels(step_type=step_type, release=release).observe(step_latency_ms)
                if step_type == "tool_call":
                    TOOL_LATENCY_MS.labels(
                        tool=getattr(step, "tool_name", "unknown"),
                        release=release,
                    ).observe(step_latency_ms)
                # retry overhead can exist even if the step failed
                retry_overhead_ms = float(getattr(step, "retry_overhead_ms", 0) or 0)
                if retry_overhead_ms > 0:
                    RETRY_OVERHEAD_MS.labels(release=release).observe(retry_overhead_ms)

            if step_type == "llm_generate":
                model = getattr(step, "model", "unknown")
                ttft_ms = float(getattr(result, "ttft_ms", 0) or 0)
                if ttft_ms > 0:
                    TTFT_MS.labels(model=model, release=release).observe(ttft_ms)

            if result and result.is_final:
                stop_reason = "completed"
                break
    finally:
        run_latency_ms = (time.perf_counter() - started_at) * 1000
        RUN_LATENCY_MS.labels(release=release).observe(run_latency_ms)
        RUN_TOTAL.labels(status=run_status, stop_reason=stop_reason, release=release).inc()

# run_latency_p95 and run_latency_p99 are usually computed on dashboard level:
# histogram_quantile(...) over agent_run_latency_ms bucket metrics.

Here is how these metrics can look together on a real dashboard:

Segment	p50 latency	p95 latency	timeout_rate	Status
gpt-4.1 + tools	1.1s	4.8s	2.9%	critical: SLO risk
mini-model + cache	420ms	1.2s	0.4%	ok
research workflow	1.7s	6.1s	1.8%	warning: p95 growing

Investigation

When a latency alert fires:

find the anomalous segment (release, tool, model);
inspect slow runs in tracing;
check retries, timeout, and stop_reason in logs;
find root cause (tool, LLM, queue, agent logic, external service).

Common Mistakes

Even when latency metrics are added, they often fail because of common mistakes below.

Only average latency is tracked

Average latency often hides degradation. For production, minimum is p50 and p95, and for critical flows also p99.

No latency breakdown by tools and step type

Without this, it is hard to know what is slow: LLM, tool, or agent loop itself. In such cases, it is hard to quickly localize tool failure.

Queue time is ignored

A run can be slow before execution even starts. Without queue_time_p95, capacity issues are easy to miss.

No timeout-rate and retry-overhead metrics

Retries can mask instability and artificially inflate latency. This often comes together with tool spam.

No alerts for p95/p99 and timeout spikes

Without alerts, the team learns about issues too late, when SLO is already violated.

Self-Check

Below is a short checklist of baseline latency monitoring before release.

run_latency_p50 and run_latency_p95 exist for production segments
Latency is broken down by step_type and tools
queue_time_p95 exists for capacity control
ttft_p95 exists for LLM steps
timeout_rate and retry_overhead_ms are tracked
Metrics are segmented by release, model, and workflow
Labels do not include run_id, request_id, or user_id
Alerts exist for p95/p99 regressions and timeout spikes
Slow runs are correlated with tracing and logs

Progress: 0/9

⚠ Baseline observability is missing

The system will be hard to debug in production. Start with run_id, structured logs, and tracing tool calls.

FAQ

Q: How is latency monitoring different from regular API speed monitoring?
A: For agents, you must monitor not only total response time but also internal steps: reasoning, tools, retries, and queue time.

Q: What is the minimum latency-metric set to start with?
A: Start with run_latency_p50/p95, tool_latency_p95, timeout_rate, and queue_time_p95.

Q: Why is p95 more important than average latency?
A: Because p95 shows what happens with slow requests, which users notice most.

Q: How to separate tool-latency issues from LLM-latency issues?
A: Compare tool_latency_p95 and ttft_p95: if only tool latency grows, bottleneck is in tools; if ttft grows, the issue is in LLM layer.

Next on this topic:

Agent Metrics — overall metrics model for agent systems.
Tool Usage Metrics — how to isolate latency at tool level.
Agent Cost Monitoring — how latency is connected to cost.
Agent Tracing — how to find the slow step in a concrete run.
Alerting In AI Agents — how to build early notifications.

Agent Latency Monitoring

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Latency Metrics

How To Read The Latency Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Only average latency is tracked

No latency breakdown by tools and step type

Queue time is ignored

No timeout-rate and retry-overhead metrics

No alerts for p95/p99 and timeout spikes

Self-Check

FAQ

Used by patterns

Related failures

Governance required

Author

Editorial note

Agent Latency Monitoring

Idea In 30 Seconds

Core Problem

How It Works

Typical Production Latency Metrics

How To Read The Latency Layer

When To Use

Implementation Example

Investigation

Common Mistakes

Only average latency is tracked

No latency breakdown by tools and step type

Queue time is ignored

No timeout-rate and retry-overhead metrics

No alerts for p95/p99 and timeout spikes

Self-Check

FAQ

Related Pages

Used by patterns

Related failures

Governance required

Author

Editorial note