Idea In 30 Seconds
Replay for AI agents means taking a real problematic trace, reproducing it in controlled conditions, and finding the failure cause step by step.
Its main value is that the team does not guess incident cause. It sees the full chain of agent decisions and the exact point where behavior broke.
Problem
Without replay, teams often debug "from memory":
- they look only at final agent response;
- they do not have full input context;
- they do not see step-level tool-call results.
In this mode, it is hard to separate symptom from cause. Fixes become inaccurate, and incidents return.
Typical outcomes of this approach:
- error is fixed locally, but real production scenario is not reproduced;
- same failure class appears again after release;
- team loses time on repeated manual investigations.
When To Use
Replay is used when:
- a production incident happened and root cause must be found;
- after model or prompt changes, an unexpected behavior
diffappears; - regression test shows critical-case failure;
- team must confirm a fix truly closes incident scenario.
Replay is most useful in systems with multi-step agent behavior and external tools.
Implementation
In practice, replay follows one principle: same trace, same run conditions, step-by-step decision analysis. Examples below are schematic and not tied to a specific framework.
How It Works In One Investigation
Short replay investigation cycle
- Trace - store input, context, steps, and tool responses.
- Replay - reproduce the same scenario in controlled environment.
- Step timeline - inspect where agent made wrong decision.
- Root cause - capture technical cause: prompt, model, tool, or runtime.
- Fix and verify - apply fix and confirm with rerun.
1. Store trace with full context
trace = {
"trace_id": "incident-2026-03-11-42",
"input": "Refund order #8472",
"conversation_state": {"user_tier": "pro"},
"steps": [
{"tool": "payments_api", "args": {"order_id": "8472"}, "result": {"status": "timeout"}},
{"tool": "fallback_policy", "args": {}, "result": {"action": "ask_for_retry"}}
],
"final_output": "Please try again later.",
"stop_reason": "fallback_used",
}
Without full trace, replay almost never reproduces the real incident cause.
2. Reproduce trace in same conditions
def replay_trace(agent, trace, runtime_config):
return agent.replay(
trace=trace,
model_version=runtime_config["model_version"],
tool_mocks=runtime_config["tool_mocks"],
timeout_sec=runtime_config["timeout_sec"],
)
If model, timeouts, or tool conditions differ, replay can look falsely safe.
3. Analyze step timeline
def find_first_bad_step(replayed_steps):
return next(
((idx, step) for idx, step in enumerate(replayed_steps) if step["status"] == "unexpected"),
None,
)
Core debugging goal is finding the first step where system leaves expected scenario.
4. Capture root cause in structured form
incident_report = {
"trace_id": "incident-2026-03-11-42",
"root_cause": "tool_timeout_not_handled_as_retryable",
"affected_component": "retry_policy",
"fix_plan": "treat payments timeout as retryable before fallback",
}
Structured root cause makes fix validation and team knowledge transfer easier.
5. Add incident to regression set
def promote_to_regression_case(trace, report):
return {
"id": trace["trace_id"],
"input": trace["input"],
"expected_behavior": {"stop_reason": "resolved"},
"tags": ["incident", "replay", report["affected_component"]],
}
After replay investigation, case should go into regression or golden dataset, otherwise incident can repeat.
Typical Mistakes
Incomplete incident trace
Logs contain final response, but no agent steps and no tool results.
Typical cause: only summary is stored, without step-level details.
Replay in different runtime conditions
Trace is replayed on different model or with different timeout/retry settings.
Typical cause: incident runtime conditions are not fixed.
Debugging only final text
Team analyzes only last response and misses failure cause in middle of run.
Typical cause: no step-by-step timeline of agent decisions.
Root cause is not captured structurally
After incident, team has verbal conclusion but no clear technical record.
Typical cause: missing incident-report template.
Case is not added to regression
Incident was fixed but not added to permanent test set.
Typical cause: replay investigation is disconnected from regression workflow.
Summary
- Replay and debugging provide reproducible analysis of production incidents.
- High-quality replay requires the same trace and the same runtime conditions.
- Debugging should follow agent steps, not only final text.
- After fix, incident case should be promoted to regression or golden dataset.
FAQ
Q: How is replay different from regression testing?
A: Regression compares system versions on case sets, while replay reproduces one real incident to find root cause.
Q: What is the minimum required for quality replay?
A: input, context state, step-by-step tool calls, their results, stop_reason, and runtime config.
Q: Can replay be done without production API access?
A: Yes. Teams usually use stored responses or mocks to reproduce incident logic in a stable way.
Q: When is a replay case considered closed?
A: When fix passes rerun replay and the same scenario consistently passes in regression set.
What Next
After replay investigations, add incident cases to Golden Datasets and validate them with Regression Testing. Use Eval Harness for standardized runs, and Unit Testing for local logic checks.