Anti-Pattern No Monitoring: absence of monitoring

Why running agents without monitoring is an anti-pattern, and the minimum traces, metrics, and alerts required for production safety.
On this page
  1. Idea In 30 Seconds
  2. Anti-Pattern Example
  3. Why It Happens And What Goes Wrong
  4. Correct Approach
  5. Quick Test
  6. How It Differs From Other Anti-Patterns
  7. No Stop Conditions vs No Monitoring
  8. Overengineering Agents vs No Monitoring
  9. Agents Without Guardrails vs No Monitoring
  10. Self-Check: Do You Have This Anti-Pattern?
  11. FAQ
  12. What Next

Idea In 30 Seconds

No Monitoring is an anti-pattern where an agent system runs almost "blind": without run traces, stop_reason, and baseline metrics.

As a result, every failure looks like "the model behaved weirdly", and the team cannot quickly find the root cause. This increases debugging time, incident cost, and the risk of repeated failures.

Simple rule: every run should leave a clear trace - run_id, key step events, stop_reason, and usage metrics.


Anti-Pattern Example

The team runs a support agent in production, but logs only the final answer.

When a user reports an error, the team cannot see where it happened.

PYTHON
result = agent.run(user_message)
logger.info("answer=%s", result.text)

In this setup, there is no basic run context:

PYTHON
# no run_id
# no step trace
# no tool status/duration
# no stop_reason

For this case, you need a minimal observability layer:

PYTHON
run_id = create_run_id()
log_run_started(run_id, user_message)
...
log_stop(run_id, stop_reason, usage)

In this case, missing monitoring adds:

  • "blind" debugging based on assumptions
  • longer recovery time after incidents
  • repeated failures of the same class

Why It Happens And What Goes Wrong

This anti-pattern often appears when the team focuses on features and postpones monitoring "for later".

Typical causes:

  • logging only final output without run-level events
  • no unified event schema for agent/tool/stop
  • missing baseline route metrics (success rate, latency, cost per request)
  • no clear observability owner in the team

As a result, teams face:

  • long MTTR - hard to quickly localize root cause
  • repeated incidents - fixes are made "blindly" without cause validation
  • hidden degradation - latency and cost grow unnoticed
  • fragile quality - the team cannot see where success rate drops
  • weak operability - impossible to explain why a run ended that way

Unlike No Stop Conditions, the core failure here is lack of visibility: even if stop logic exists, the team cannot see how it worked in a specific run.

Typical production signals that monitoring is insufficient:

  • support learns about issues earlier than monitoring does
  • the team cannot reconstruct the event chain by run_id
  • logs often miss stop_reason or it cannot be matched to a run
  • cost per request or P95 grows, but the team notices too late

It is important that each agent/tool step is part of run execution. Without traces and metrics, the system becomes a black box for the team, and the cause-effect link between action and failure is lost.

Correct Approach

Start with a minimal observability baseline and make it mandatory for every route. Add new metrics only when they close concrete incidents or blind spots.

Practical framework:

  • record run_id and step_id for every run
  • log tool-call fields: tool, status (ok/error), duration_ms, args_hash
  • log stop_reason and usage metrics for each run
  • track core dashboards (for example, success rate, P95, cost per request) and set alerts on critical deviations
PYTHON
def run_support_agent(user_message: str):
    run_id = create_run_id()
    log_event("run_started", run_id=run_id, message=user_message)

    for step_id in range(MAX_STEPS):
        decision = agent.next_step(user_message)

        if decision.type == "tool_call":
            started = now_ms()
            try:
                result = run_tool(decision.tool, decision.args)
                log_event(
                    "tool_result",
                    run_id=run_id,
                    step_id=step_id,
                    tool=decision.tool,
                    duration_ms=now_ms() - started,
                    status="ok",
                )
            except Exception:
                log_event(
                    "tool_result",
                    run_id=run_id,
                    step_id=step_id,
                    tool=decision.tool,
                    duration_ms=now_ms() - started,
                    status="error",
                )
                raise
            continue

        if decision.type == "final_answer":
            log_event("stop", run_id=run_id, stop_reason="final_answer")
            return decision.output

    log_event("stop", run_id=run_id, stop_reason="max_steps_exceeded")
    return fallback_answer()  # safe default response or escalation

With this setup, every run becomes transparent: the team sees what happened, where failure occurred, and how to verify a fix.

Quick Test

If the answer to these questions is "yes", you have No Monitoring anti-pattern risk:

  • Is it hard to explain in 1-2 minutes why a specific run ended the way it did?
  • Does support often report failures earlier than your alerts?
  • Is it impossible to replay the latest failed run from logs and metrics?

How It Differs From Other Anti-Patterns

No Stop Conditions vs No Monitoring

No Stop ConditionsNo Monitoring
Main problem: the agent loop has no clear completion conditions.Main problem: there is no visibility of run/step events, metrics, and stop reasons.
When it appears: when max_steps, timeout, no_progress are missing.When it appears: when runs work without traces and baseline operational metrics.

In short: No Stop Conditions is about loop control, while No Monitoring is about not seeing what actually happened.

Overengineering Agents vs No Monitoring

Overengineering AgentsNo Monitoring
Main problem: extra architecture layers without measurable benefit.Main problem: no operational transparency to manage a complex system.
When it appears: planner/router/policy layers are added to simple cases "just in case".When it appears: runs go without traces, and the team cannot see which layer actually caused the failure.

In short: Overengineering Agents increases system complexity, and No Monitoring makes that complexity invisible and unmanaged.

Agents Without Guardrails vs No Monitoring

Agents Without GuardrailsNo Monitoring
Main problem: the agent runs without clear policy boundaries and control constraints.Main problem: there is no operational transparency to manage the agent system.
When it appears: critical safety and access rules are not enforced in explicit runtime checks.When it appears: even simple incidents have to be investigated without run-level data.

In short: Agents Without Guardrails is about missing control boundaries, while No Monitoring is about missing visibility into how those boundaries did or did not work in a run.

Self-Check: Do You Have This Anti-Pattern?

Quick check for the anti-pattern No Monitoring.
Mark items for your system and check status below.

Check your system:

Progress: 0/8

⚠ There are signs of this anti-pattern

Move simple steps into a workflow and keep the agent only for complex decisions.

FAQ

Q: Is it enough to just write application logs?
A: No. Agent systems need run-level monitoring: run_id, step events, stop_reason, tool traces, and quality/cost metrics.

Q: Where to start if monitoring is almost absent?
A: Start with the minimum: run_id, stop_reason, tool status/duration, success rate, P95, cost per request. This already reduces blind spots sharply.

Q: Do we need replay immediately?
A: Full replay is not mandatory on day one, but at least partial run reconstruction from logs must exist. Otherwise fix validation stays guesswork.


What Next

Related anti-patterns:

What to build instead:

⏱️ 7 min read β€’ Updated March 17, 2026Difficulty: β˜…β˜…β˜…
Implement in OnceOnly
Safe defaults for tool permissions + write gating.
Use in OnceOnly
# onceonly guardrails (concept)
version: 1
tools:
  default_mode: read_only
  allowlist:
    - search.read
    - kb.read
    - http.get
writes:
  enabled: false
  require_approval: true
  idempotency: true
controls:
  kill_switch: { enabled: true, mode: disable_writes }
audit:
  enabled: true
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.

Author

Nick β€” engineer building infrastructure for production AI agents.

Focus: agent patterns, failure modes, runtime control, and system reliability.

πŸ”— GitHub: https://github.com/mykolademyanov


Editorial note

This documentation is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Content is grounded in real-world failures, post-mortems, and operational incidents in deployed AI agent systems.