Unit Testing AI Agents: Testing Agent Logic

Idea In 30 Seconds

Unit tests for AI agents validate local logic: tool selection, response handling, stop reason, and output format.

Their main value is speed, determinism, and isolation, which helps you immediately see which exact part of the system broke.

Problem

Without unit tests, teams often validate agents only through manual runs or heavy end-to-end tests.

This creates common problems:

errors in local logic are found too late;
it is hard to tell whether code broke or an external dependency failed;
small regressions accumulate and reach production.

As a result, even a simple change can trigger a chain of opaque failures in production.

When To Use

You should write unit tests whenever you have local and verifiable logic:

tool selection by request type;
output schema validation;
tool error handling;
run completion conditions (stop_reason);
safety rules at step or function level.

If behavior can be validated without network calls and without full agent runtime, it is a good unit-test candidate.

Implementation

In practice, unit testing for agents follows a simple rule: one behavior, one test, controlled conditions. The examples below are schematic and not tied to a specific framework.

Unit level is not suitable for evaluating overall response quality, usefulness of final output, or the agent's general "smartness". For that, use eval harness and golden datasets.

How It Works In One Test

Short unit-test cycle

Test case - one behavior to validate.
Setup - fakes, mocks, and fixed conditions.
Run - execute a specific function or step.
Assertions - validate tool choice, schema, stop reason.

1. Isolate agent decision logic

PYTHON

def choose_tool(intent: str, tools_allowed: list[str]) -> str:
    if intent == "price_lookup" and "crypto_price_api" in tools_allowed:
        return "crypto_price_api"
    return "web_search"

The fewer side dependencies a function has, the more stable the test.

2. Replace external tools

PYTHON

class FakeTools:
    def crypto_price_api(self, symbol: str):
        return {"symbol": symbol, "price": 65000}

A unit test should validate agent logic, not availability of external APIs.

3. Validate more than final response text

PYTHON

def test_tool_selection_and_schema():
    tools = FakeTools()
    agent = Agent(tools=tools)
    result = agent.run("What is the price of BTC?")

    assert result.selected_tool == "crypto_price_api"
    assert isinstance(result.output, dict)
    assert result.output["symbol"] == "BTC"

It is better to lock structural invariants (selected_tool, schema, stop reason), not only final text.

4. Test negative scenarios

PYTHON

def test_tool_error_is_handled():
    tools = FailingTools()
    agent = Agent(tools=tools)
    result = agent.run("Find BTC price")

    assert result.stop_reason == "tool_error_handled"
    assert result.error is not None

Tool failures should have predictable and testable behavior.

5. Integrate unit tests into CI

YAML

name: unit-tests
on:
  pull_request:
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: pytest tests/unit -q

If a test is slow or unstable, move it to eval harness or integration layer.

Typical Mistakes

Dependence on real APIs

The test fails not because of agent logic, but because of network or external service availability.

Typical cause: missing fakes or mocks for tools.

Testing only final text

The test is green, but it does not guarantee correct tool selection or output format.

Typical cause: no checks for selected_tool, schema, and stop reason.

Too much logic in one test

One test validates several scenarios at once, and when it fails it is unclear what exactly broke.

Typical cause: no "one test, one behavior" rule.

Unstable test environment

Even correct unit tests become noisy if dependencies, configuration, or tool replacements drift between runs.

Typical cause: unit tests still partially depend on real runtime or external calls.

Trying to cover everything via e2e runs

The team writes only large scenarios and skips basic local validation.

Typical cause: no clear split between unit, eval, and regression levels.

Summary

Quick take

Unit tests for agents validate local and deterministic logic.
Replace tools via fakes or mocks to remove network noise.
Lock structural checks: tool choice, schema, stop reason.
Fast unit tests should run in every PR.

FAQ

Q: Can unit tests replace eval harness?
A: No. Unit tests catch local failures, while eval harness validates full agent behavior on complete scenarios.

Q: Should we connect a real LLM in unit tests?
A: Prefer minimal usage. For unit level, deterministic logic with fakes or mocks and controlled conditions works better.

Q: What must every agent unit test verify?
A: Tool selection, output structure, error handling, and stop reason in negative scenarios.

Q: When should a test move from unit level to eval level?
A: When it depends on full scenario behavior, response-quality metrics, or baseline comparisons.

What Next

After unit level, add scenario validation through Eval Harness, and maintain a stable case set through Golden Datasets.

For version-to-version control, add Regression Testing. For production-incident analysis, use Replay and Debugging. Keep the full picture in Testing Strategy.

Unit Testing AI Agents: Testing Agent Logic

Idea In 30 Seconds

Problem

When To Use

Implementation

How It Works In One Test

1. Isolate agent decision logic

2. Replace external tools

3. Validate more than final response text

4. Test negative scenarios

5. Integrate unit tests into CI

Typical Mistakes

Dependence on real APIs

Testing only final text

Too much logic in one test

Unstable test environment

Trying to cover everything via e2e runs

Summary

FAQ

What Next

Used by patterns

Related failures

Governance required

Author

Editorial note