Anti-Pattern No Monitoring: absence of monitoring

Idea In 30 Seconds

No Monitoring is an anti-pattern where an agent system runs almost "blind": without run traces, stop_reason, and baseline metrics.

As a result, every failure looks like "the model behaved weirdly", and the team cannot quickly find the root cause. This increases debugging time, incident cost, and the risk of repeated failures.

Simple rule: every run should leave a clear trace - run_id, key step events, stop_reason, and usage metrics.

Anti-Pattern Example

The team runs a support agent in production, but logs only the final answer.

When a user reports an error, the team cannot see where it happened.

PYTHON

result = agent.run(user_message)
logger.info("answer=%s", result.text)

In this setup, there is no basic run context:

PYTHON

# no run_id
# no step trace
# no tool status/duration
# no stop_reason

For this case, you need a minimal observability layer:

PYTHON

run_id = create_run_id()
log_run_started(run_id, user_message)
...
log_stop(run_id, stop_reason, usage)

In this case, missing monitoring adds:

"blind" debugging based on assumptions
longer recovery time after incidents
repeated failures of the same class

Why It Happens And What Goes Wrong

This anti-pattern often appears when the team focuses on features and postpones monitoring "for later".

Typical causes:

logging only final output without run-level events
no unified event schema for agent/tool/stop
missing baseline route metrics (success rate, latency, cost per request)
no clear observability owner in the team

As a result, teams face:

long MTTR - hard to quickly localize root cause
repeated incidents - fixes are made "blindly" without cause validation
hidden degradation - latency and cost grow unnoticed
fragile quality - the team cannot see where success rate drops
weak operability - impossible to explain why a run ended that way

Unlike No Stop Conditions, the core failure here is lack of visibility: even if stop logic exists, the team cannot see how it worked in a specific run.

Typical production signals that monitoring is insufficient:

support learns about issues earlier than monitoring does
the team cannot reconstruct the event chain by run_id
logs often miss stop_reason or it cannot be matched to a run
cost per request or P95 grows, but the team notices too late

It is important that each agent/tool step is part of run execution. Without traces and metrics, the system becomes a black box for the team, and the cause-effect link between action and failure is lost.

Correct Approach

Start with a minimal observability baseline and make it mandatory for every route. Add new metrics only when they close concrete incidents or blind spots.

Practical framework:

record run_id and step_id for every run
log tool-call fields: tool, status (ok/error), duration_ms, args_hash
log stop_reason and usage metrics for each run
track core dashboards (for example, success rate, P95, cost per request) and set alerts on critical deviations

PYTHON

def run_support_agent(user_message: str):
    run_id = create_run_id()
    log_event("run_started", run_id=run_id, message=user_message)

    for step_id in range(MAX_STEPS):
        decision = agent.next_step(user_message)

        if decision.type == "tool_call":
            started = now_ms()
            try:
                result = run_tool(decision.tool, decision.args)
                log_event(
                    "tool_result",
                    run_id=run_id,
                    step_id=step_id,
                    tool=decision.tool,
                    duration_ms=now_ms() - started,
                    status="ok",
                )
            except Exception:
                log_event(
                    "tool_result",
                    run_id=run_id,
                    step_id=step_id,
                    tool=decision.tool,
                    duration_ms=now_ms() - started,
                    status="error",
                )
                raise
            continue

        if decision.type == "final_answer":
            log_event("stop", run_id=run_id, stop_reason="final_answer")
            return decision.output

    log_event("stop", run_id=run_id, stop_reason="max_steps_exceeded")
    return fallback_answer()  # safe default response or escalation

With this setup, every run becomes transparent: the team sees what happened, where failure occurred, and how to verify a fix.

Quick Test

If the answer to these questions is "yes", you have No Monitoring anti-pattern risk:

Is it hard to explain in 1-2 minutes why a specific run ended the way it did?
Does support often report failures earlier than your alerts?
Is it impossible to replay the latest failed run from logs and metrics?

How It Differs From Other Anti-Patterns

No Stop Conditions vs No Monitoring

No Stop Conditions	No Monitoring
Main problem: the agent loop has no clear completion conditions.	Main problem: there is no visibility of run/step events, metrics, and stop reasons.
When it appears: when `max_steps`, `timeout`, `no_progress` are missing.	When it appears: when runs work without traces and baseline operational metrics.

In short: No Stop Conditions is about loop control, while No Monitoring is about not seeing what actually happened.

Overengineering Agents vs No Monitoring

Overengineering Agents	No Monitoring
Main problem: extra architecture layers without measurable benefit.	Main problem: no operational transparency to manage a complex system.
When it appears: planner/router/policy layers are added to simple cases "just in case".	When it appears: runs go without traces, and the team cannot see which layer actually caused the failure.

In short: Overengineering Agents increases system complexity, and No Monitoring makes that complexity invisible and unmanaged.

Agents Without Guardrails vs No Monitoring

Agents Without Guardrails	No Monitoring
Main problem: the agent runs without clear policy boundaries and control constraints.	Main problem: there is no operational transparency to manage the agent system.
When it appears: critical safety and access rules are not enforced in explicit runtime checks.	When it appears: even simple incidents have to be investigated without run-level data.

In short: Agents Without Guardrails is about missing control boundaries, while No Monitoring is about missing visibility into how those boundaries did or did not work in a run.

Self-Check: Do You Have This Anti-Pattern?

Quick check for the anti-pattern No Monitoring.
Mark items for your system and check status below.

Check your system:

a run starts without a stable run_id that can be traced to completion
logs do not contain a reliable stop_reason for each completed run
tool-call events do not include core fields (status, duration, tool, args_hash)
the team cannot quickly reconstruct the step chain of a problematic run
alerts do not cover key metrics (success rate, P95, cost per request)
support learns about incidents earlier than system monitoring
changes in prompt/tool do not have transparent impact in route metrics
after a fix, there is no way to verify that the same failure class is truly gone

Progress: 0/8

⚠ There are signs of this anti-pattern

Move simple steps into a workflow and keep the agent only for complex decisions.

FAQ

Q: Is it enough to just write application logs?
A: No. Agent systems need run-level monitoring: run_id, step events, stop_reason, tool traces, and quality/cost metrics.

Q: Where to start if monitoring is almost absent?
A: Start with the minimum: run_id, stop_reason, tool status/duration, success rate, P95, cost per request. This already reduces blind spots sharply.

Q: Do we need replay immediately?
A: Full replay is not mandatory on day one, but at least partial run reconstruction from logs must exist. Otherwise fix validation stays guesswork.

What Next

Related anti-patterns:

No Stop Conditions - when the agent loop has no clear completion conditions.
Agents Without Guardrails - when the agent runs without clear policy boundaries.
Overengineering Agents - when architecture grows unnecessary layers.

What to build instead:

Stop Conditions - how to enforce controlled run completion.
Tool Execution Layer - where to centralize logging and control of tool calls.
Agent Runtime - how to make agent execution transparent at step and decision level.

Anti-Pattern No Monitoring: absence of monitoring

Idea In 30 Seconds

Anti-Pattern Example

Why It Happens And What Goes Wrong

Correct Approach

Quick Test

How It Differs From Other Anti-Patterns

No Stop Conditions vs No Monitoring

Overengineering Agents vs No Monitoring

Agents Without Guardrails vs No Monitoring

Self-Check: Do You Have This Anti-Pattern?

FAQ

What Next

Used by patterns

Related failures

Governance required

Author

Editorial note