Idea In 30 Seconds
No Monitoring is an anti-pattern where an agent system runs almost "blind": without run traces, stop_reason, and baseline metrics.
As a result, every failure looks like "the model behaved weirdly", and the team cannot quickly find the root cause. This increases debugging time, incident cost, and the risk of repeated failures.
Simple rule: every run should leave a clear trace - run_id, key step events, stop_reason, and usage metrics.
Anti-Pattern Example
The team runs a support agent in production, but logs only the final answer.
When a user reports an error, the team cannot see where it happened.
result = agent.run(user_message)
logger.info("answer=%s", result.text)
In this setup, there is no basic run context:
# no run_id
# no step trace
# no tool status/duration
# no stop_reason
For this case, you need a minimal observability layer:
run_id = create_run_id()
log_run_started(run_id, user_message)
...
log_stop(run_id, stop_reason, usage)
In this case, missing monitoring adds:
- "blind" debugging based on assumptions
- longer recovery time after incidents
- repeated failures of the same class
Why It Happens And What Goes Wrong
This anti-pattern often appears when the team focuses on features and postpones monitoring "for later".
Typical causes:
- logging only final output without run-level events
- no unified event schema for agent/tool/stop
- missing baseline route metrics (
success rate,latency,cost per request) - no clear observability owner in the team
As a result, teams face:
- long MTTR - hard to quickly localize root cause
- repeated incidents - fixes are made "blindly" without cause validation
- hidden degradation - latency and cost grow unnoticed
- fragile quality - the team cannot see where
success ratedrops - weak operability - impossible to explain why a run ended that way
Unlike No Stop Conditions, the core failure here is lack of visibility: even if stop logic exists, the team cannot see how it worked in a specific run.
Typical production signals that monitoring is insufficient:
- support learns about issues earlier than monitoring does
- the team cannot reconstruct the event chain by
run_id - logs often miss
stop_reasonor it cannot be matched to a run cost per requestorP95grows, but the team notices too late
It is important that each agent/tool step is part of run execution. Without traces and metrics, the system becomes a black box for the team, and the cause-effect link between action and failure is lost.
Correct Approach
Start with a minimal observability baseline and make it mandatory for every route. Add new metrics only when they close concrete incidents or blind spots.
Practical framework:
- record
run_idandstep_idfor every run - log tool-call fields:
tool,status(ok/error),duration_ms,args_hash - log
stop_reasonand usage metrics for each run - track core dashboards (for example,
success rate,P95,cost per request) and set alerts on critical deviations
def run_support_agent(user_message: str):
run_id = create_run_id()
log_event("run_started", run_id=run_id, message=user_message)
for step_id in range(MAX_STEPS):
decision = agent.next_step(user_message)
if decision.type == "tool_call":
started = now_ms()
try:
result = run_tool(decision.tool, decision.args)
log_event(
"tool_result",
run_id=run_id,
step_id=step_id,
tool=decision.tool,
duration_ms=now_ms() - started,
status="ok",
)
except Exception:
log_event(
"tool_result",
run_id=run_id,
step_id=step_id,
tool=decision.tool,
duration_ms=now_ms() - started,
status="error",
)
raise
continue
if decision.type == "final_answer":
log_event("stop", run_id=run_id, stop_reason="final_answer")
return decision.output
log_event("stop", run_id=run_id, stop_reason="max_steps_exceeded")
return fallback_answer() # safe default response or escalation
With this setup, every run becomes transparent: the team sees what happened, where failure occurred, and how to verify a fix.
Quick Test
If the answer to these questions is "yes", you have No Monitoring anti-pattern risk:
- Is it hard to explain in 1-2 minutes why a specific run ended the way it did?
- Does support often report failures earlier than your alerts?
- Is it impossible to replay the latest failed run from logs and metrics?
How It Differs From Other Anti-Patterns
No Stop Conditions vs No Monitoring
| No Stop Conditions | No Monitoring |
|---|---|
| Main problem: the agent loop has no clear completion conditions. | Main problem: there is no visibility of run/step events, metrics, and stop reasons. |
When it appears: when max_steps, timeout, no_progress are missing. | When it appears: when runs work without traces and baseline operational metrics. |
In short: No Stop Conditions is about loop control, while No Monitoring is about not seeing what actually happened.
Overengineering Agents vs No Monitoring
| Overengineering Agents | No Monitoring |
|---|---|
| Main problem: extra architecture layers without measurable benefit. | Main problem: no operational transparency to manage a complex system. |
| When it appears: planner/router/policy layers are added to simple cases "just in case". | When it appears: runs go without traces, and the team cannot see which layer actually caused the failure. |
In short: Overengineering Agents increases system complexity, and No Monitoring makes that complexity invisible and unmanaged.
Agents Without Guardrails vs No Monitoring
| Agents Without Guardrails | No Monitoring |
|---|---|
| Main problem: the agent runs without clear policy boundaries and control constraints. | Main problem: there is no operational transparency to manage the agent system. |
| When it appears: critical safety and access rules are not enforced in explicit runtime checks. | When it appears: even simple incidents have to be investigated without run-level data. |
In short: Agents Without Guardrails is about missing control boundaries, while No Monitoring is about missing visibility into how those boundaries did or did not work in a run.
Self-Check: Do You Have This Anti-Pattern?
Quick check for the anti-pattern No Monitoring.
Mark items for your system and check status below.
Check your system:
Progress: 0/8
β There are signs of this anti-pattern
Move simple steps into a workflow and keep the agent only for complex decisions.
FAQ
Q: Is it enough to just write application logs?
A: No. Agent systems need run-level monitoring: run_id, step events, stop_reason, tool traces, and quality/cost metrics.
Q: Where to start if monitoring is almost absent?
A: Start with the minimum: run_id, stop_reason, tool status/duration, success rate, P95, cost per request. This already reduces blind spots sharply.
Q: Do we need replay immediately?
A: Full replay is not mandatory on day one, but at least partial run reconstruction from logs must exist. Otherwise fix validation stays guesswork.
What Next
Related anti-patterns:
- No Stop Conditions - when the agent loop has no clear completion conditions.
- Agents Without Guardrails - when the agent runs without clear policy boundaries.
- Overengineering Agents - when architecture grows unnecessary layers.
What to build instead:
- Stop Conditions - how to enforce controlled run completion.
- Tool Execution Layer - where to centralize logging and control of tool calls.
- Agent Runtime - how to make agent execution transparent at step and decision level.