Unit Testing AI Agents: Testing Agent Logic

How to write unit tests for agent logic, reasoning steps and tool execution.
On this page
  1. Idea In 30 Seconds
  2. Problem
  3. When To Use
  4. Implementation
  5. How It Works In One Test
  6. 1. Isolate agent decision logic
  7. 2. Replace external tools
  8. 3. Validate more than final response text
  9. 4. Test negative scenarios
  10. 5. Integrate unit tests into CI
  11. Typical Mistakes
  12. Dependence on real APIs
  13. Testing only final text
  14. Too much logic in one test
  15. Unstable test environment
  16. Trying to cover everything via e2e runs
  17. Summary
  18. FAQ
  19. What Next

Idea In 30 Seconds

Unit tests for AI agents validate local logic: tool selection, response handling, stop reason, and output format.

Their main value is speed, determinism, and isolation, which helps you immediately see which exact part of the system broke.

Problem

Without unit tests, teams often validate agents only through manual runs or heavy end-to-end tests.

This creates common problems:

  • errors in local logic are found too late;
  • it is hard to tell whether code broke or an external dependency failed;
  • small regressions accumulate and reach production.

As a result, even a simple change can trigger a chain of opaque failures in production.

When To Use

You should write unit tests whenever you have local and verifiable logic:

  • tool selection by request type;
  • output schema validation;
  • tool error handling;
  • run completion conditions (stop_reason);
  • safety rules at step or function level.

If behavior can be validated without network calls and without full agent runtime, it is a good unit-test candidate.

Implementation

In practice, unit testing for agents follows a simple rule: one behavior, one test, controlled conditions. The examples below are schematic and not tied to a specific framework.

Unit level is not suitable for evaluating overall response quality, usefulness of final output, or the agent's general "smartness". For that, use eval harness and golden datasets.

How It Works In One Test

Unit test flow for agents
πŸ§ͺ Unit test run
πŸ—’οΈ
Test casesingle behavior to verify
πŸ”§
Setupfake tools and fixed runtime
▢️
Runexecute one agent step or function
βœ…
Assertionstool choice, schema, stop reason
πŸ“Œ
Resultdeterministic verdict
βœ…
Passsafe to keep behavior
πŸ”
Failfix logic and rerun
Unit-test focus
🎯 local logic and boundaries
🚫 not end-to-end quality of full answers
Short unit-test cycle
  • Test case - one behavior to validate.
  • Setup - fakes, mocks, and fixed conditions.
  • Run - execute a specific function or step.
  • Assertions - validate tool choice, schema, stop reason.

1. Isolate agent decision logic

PYTHON
def choose_tool(intent: str, tools_allowed: list[str]) -> str:
    if intent == "price_lookup" and "crypto_price_api" in tools_allowed:
        return "crypto_price_api"
    return "web_search"

The fewer side dependencies a function has, the more stable the test.

2. Replace external tools

PYTHON
class FakeTools:
    def crypto_price_api(self, symbol: str):
        return {"symbol": symbol, "price": 65000}

A unit test should validate agent logic, not availability of external APIs.

3. Validate more than final response text

PYTHON
def test_tool_selection_and_schema():
    tools = FakeTools()
    agent = Agent(tools=tools)
    result = agent.run("What is the price of BTC?")

    assert result.selected_tool == "crypto_price_api"
    assert isinstance(result.output, dict)
    assert result.output["symbol"] == "BTC"

It is better to lock structural invariants (selected_tool, schema, stop reason), not only final text.

4. Test negative scenarios

PYTHON
def test_tool_error_is_handled():
    tools = FailingTools()
    agent = Agent(tools=tools)
    result = agent.run("Find BTC price")

    assert result.stop_reason == "tool_error_handled"
    assert result.error is not None

Tool failures should have predictable and testable behavior.

5. Integrate unit tests into CI

YAML
name: unit-tests
on:
  pull_request:
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: pytest tests/unit -q

If a test is slow or unstable, move it to eval harness or integration layer.

Typical Mistakes

Dependence on real APIs

The test fails not because of agent logic, but because of network or external service availability.

Typical cause: missing fakes or mocks for tools.

Testing only final text

The test is green, but it does not guarantee correct tool selection or output format.

Typical cause: no checks for selected_tool, schema, and stop reason.

Too much logic in one test

One test validates several scenarios at once, and when it fails it is unclear what exactly broke.

Typical cause: no "one test, one behavior" rule.

Unstable test environment

Even correct unit tests become noisy if dependencies, configuration, or tool replacements drift between runs.

Typical cause: unit tests still partially depend on real runtime or external calls.

Trying to cover everything via e2e runs

The team writes only large scenarios and skips basic local validation.

Typical cause: no clear split between unit, eval, and regression levels.

Summary

Quick take
  • Unit tests for agents validate local and deterministic logic.
  • Replace tools via fakes or mocks to remove network noise.
  • Lock structural checks: tool choice, schema, stop reason.
  • Fast unit tests should run in every PR.

FAQ

Q: Can unit tests replace eval harness?
A: No. Unit tests catch local failures, while eval harness validates full agent behavior on complete scenarios.

Q: Should we connect a real LLM in unit tests?
A: Prefer minimal usage. For unit level, deterministic logic with fakes or mocks and controlled conditions works better.

Q: What must every agent unit test verify?
A: Tool selection, output structure, error handling, and stop reason in negative scenarios.

Q: When should a test move from unit level to eval level?
A: When it depends on full scenario behavior, response-quality metrics, or baseline comparisons.

What Next

After unit level, add scenario validation through Eval Harness, and maintain a stable case set through Golden Datasets.

For version-to-version control, add Regression Testing. For production-incident analysis, use Replay and Debugging. Keep the full picture in Testing Strategy.

⏱️ 5 min read β€’ Updated March 13, 2026Difficulty: β˜…β˜…β˜†
Integrated: production controlOnceOnly
Add guardrails to tool-calling agents
Ship this pattern with governance:
  • Budgets (steps / spend caps)
  • Tool permissions (allowlist / blocklist)
  • Kill switch & incident stop
  • Idempotency & dedupe
  • Audit logs & traceability
Integrated mention: OnceOnly is a control layer for production agent systems.
Author

This documentation is curated and maintained by engineers who ship AI agents in production.

The content is AI-assisted, with human editorial responsibility for accuracy, clarity, and production relevance.

Patterns and recommendations are grounded in post-mortems, failure modes, and operational incidents in deployed systems, including during the development and operation of governance infrastructure for agents at OnceOnly.