Code-Execution Agent in Python: Full Example

Pattern Essence (Brief)

Code-Execution Agent means: the agent can propose code, but execution happens only through an execution boundary with policies and limits.

Before running code, the system performs:

action contract validation
policy check (language, imports, denied calls)
separate subprocess execution (best-effort, not a security sandbox) with timeout/output limits
output validation against contract

Learn More About Code-Execution Agent

What This Example Demonstrates

the agent forms a proposed_action with code and payload
runtime takes allowed_languages/max_code_chars/timeout from policy_hints (with fallback defaults)
network_access=denied in this demo is a contract-level requirement (real network enforcement requires container/jail)
policy allowlist and execution allowlist are split (python/javascript vs actual python)
static code check blocks unsafe imports/calls and dangerous indirection primitives (getattr) + any dunder references (__*)
static policy blocks only obvious URL literals (http://, https://) and is not network enforcement
DENIED_GLOBAL_NAMES is applied only to untrusted generated code; host-side gateway code may import os/subprocess/sys to run the boundary
code runs in a separate subprocess boundary (best-effort, not a security sandbox) with timeout
output caps in the demo are checked post-factum after process completion
output is validated by schema before forming the business response
trace/history provide auditability of the full cycle plan -> policy -> execute -> finalize

Architecture

agent.py generates an action with Python code to compute incident metrics.
gateway.py validates the action contract and policy-check.
If policy pass: execution layer runs code in a separate subprocess boundary (best-effort, not a security sandbox) with limits.
Result is parsed as JSON and validated against required schema.
main.py collects aggregate, trace/history, and returns an operations brief.

What You Will See on Run

step 1: code plan (code_hash, chars)
step 2: policy decision (allow/deny + reason)
step 3: execution metrics (exec_ms, stdout_bytes, stderr_bytes)
step 4: validated metrics -> final brief

Project Structure

TEXT

agent-patterns/
└── code-execution-agent/
    └── python/
        ├── main.py
        ├── gateway.py
        ├── agent.py
        ├── context.py
        ├── README.md
        └── requirements.txt

How to run

BASH

git clone https://github.com/AgentPatterns-tech/agentpatterns.git
cd agentpatterns

cd agent-patterns/code-execution-agent/python
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Python 3.11+ is required.

Option via export:

BASH

export OPENAI_API_KEY="sk-..."
# optional:
# export OPENAI_MODEL="gpt-4.1-mini"
# export OPENAI_TIMEOUT_SECONDS="60"

python main.py

Option via .env (optional)

BASH

cat > .env <<'EOF'
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4.1-mini
OPENAI_TIMEOUT_SECONDS=60
EOF

set -a
source .env
set +a

python main.py

This is the shell variant (macOS/Linux). On Windows, it is easier to use environment set commands or, if desired, python-dotenv.

Task

Production case:

"Safely execute code to compute health metrics of a payment incident and return an operations brief."

Code

`context.py` — request envelope

PYTHON

from __future__ import annotations

from typing import Any


def build_request(*, report_date: str, region: str, incident_id: str) -> dict[str, Any]:
    transactions: list[dict[str, Any]] = []
    for idx in range(60):
        is_failed = idx in {7, 34}
        transactions.append(
            {
                "transaction_id": f"txn_{idx + 1:03d}",
                "status": "failed" if is_failed else "paid",
                "chargeback": idx == 34,
                "latency_ms": 150 + (idx % 40) + (25 if is_failed else 0),
            }
        )

    return {
        "request": {
            "report_date": report_date,
            "region": region.upper(),
            "incident_id": incident_id,
            "transactions": transactions,
        },
        "policy_hints": {
            "allowed_languages": ["python", "javascript"],
            "max_code_chars": 2400,
            "exec_timeout_seconds": 2.0,
            "network_access": "denied",
        },
    }

`agent.py` — proposed action + final answer

PYTHON

from __future__ import annotations

from typing import Any


def propose_code_execution_plan(*, goal: str, request: dict[str, Any]) -> dict[str, Any]:
    req = request["request"]
    del goal

    code = """
import json
import statistics

payload = json.loads(input())
rows = payload["transactions"]

total = len(rows)
failed = sum(1 for row in rows if row["status"] != "paid")
chargeback_alerts = sum(1 for row in rows if row.get("chargeback") is True)
failed_rate = (failed / total) if total else 0.0

latencies = [float(row["latency_ms"]) for row in rows]
avg_latency = statistics.fmean(latencies) if latencies else 0.0

if latencies:
    sorted_latencies = sorted(latencies)
    p95_idx = int(round((len(sorted_latencies) - 1) * 0.95))
    p95_latency = sorted_latencies[p95_idx]
else:
    p95_latency = 0.0

severity = "P1" if failed_rate >= 0.03 else "P2"
eta_minutes = 45 if severity == "P1" else 20

print(
    json.dumps(
        {
            "failed_payment_rate": failed_rate,
            "chargeback_alerts": chargeback_alerts,
            "incident_severity": severity,
            "eta_minutes": eta_minutes,
            "affected_checkout_share": failed_rate,
            "avg_latency_ms": avg_latency,
            "p95_latency_ms": p95_latency,
            "sample_size": total,
            "incident_id": payload["incident_id"],
            "region": payload["region"],
        },
        separators=(",", ":"),
    )
)
""".strip()

    return {
        "action": {
            "id": "c1",
            "language": "python",
            "entrypoint": "main.py",
            "code": code,
            "input_payload": {
                "incident_id": req["incident_id"],
                "region": req["region"],
                "transactions": req["transactions"],
            },
        }
    }


def compose_final_answer(
    *,
    request: dict[str, Any],
    aggregate: dict[str, Any],
    execution_summary: dict[str, Any],
) -> str:
    req = request["request"]
    metrics = aggregate["metrics"]

    return (
        f"Code execution brief ({req['region']}, {req['report_date']}): incident {req['incident_id']} is "
        f"{metrics['incident_severity']} with failed payments at {metrics['failed_payment_rate_pct']}% and "
        f"{metrics['chargeback_alerts']} chargeback alerts. Affected checkout share is "
        f"{metrics['affected_checkout_share_pct']}%, average latency is {metrics['avg_latency_ms']} ms "
        f"(p95 {metrics['p95_latency_ms']} ms), and ETA is ~{metrics['eta_minutes']} minutes. "
        f"Executed in a separate subprocess boundary (best-effort, not a security sandbox) "
        f"({execution_summary['exec_ms']} ms, {execution_summary['stdout_bytes']} stdout bytes, "
        f"{execution_summary['stderr_bytes']} stderr bytes)."
    )

`gateway.py` — policy + execution + output validation

PYTHON

from __future__ import annotations

import ast
import hashlib
import json
import os
import subprocess
import sys
import tempfile
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any


class StopRun(Exception):
    def __init__(self, reason: str, *, details: dict[str, Any] | None = None):
        super().__init__(reason)
        self.reason = reason
        self.details = details or {}


@dataclass(frozen=True)
class Budget:
    max_seconds: int = 25
    max_code_chars: int = 2400
    exec_timeout_seconds: float = 2.0
    max_stdout_bytes: int = 4096
    max_stderr_bytes: int = 4096


@dataclass(frozen=True)
class Decision:
    kind: str
    reason: str


ALLOWED_IMPORTS = {"json", "statistics"}
# Blocked function calls in generated code.
DENIED_CALL_NAMES = {"exec", "eval", "compile", "__import__", "open"}
# NOTE: heuristic hardening — blocks suspicious attribute names regardless of receiver type.
DENIED_CALL_ATTRS = {"system", "popen", "fork", "spawn", "connect", "request", "urlopen"}
# Blocked global name references in generated code (introspection/indirection primitives).
DENIED_NAME_REFERENCES = {
    "builtins",
    "getattr",
    "setattr",
    "delattr",
    "globals",
    "locals",
    "vars",
    "dir",
}
# Blocked identifier references in generated code (belt-and-suspenders hardening).
DENIED_GLOBAL_NAMES = {"os", "sys", "subprocess", "socket", "pathlib", "importlib"}


def code_hash(code: str) -> str:
    return hashlib.sha256(code.encode("utf-8")).hexdigest()[:16]


def validate_code_action(raw: Any, *, max_code_chars: int) -> dict[str, Any]:
    if not isinstance(raw, dict):
        raise StopRun("invalid_action:not_object")

    action_id = raw.get("id")
    language = raw.get("language")
    entrypoint = raw.get("entrypoint")
    code = raw.get("code")
    input_payload = raw.get("input_payload")

    if not isinstance(action_id, str) or not action_id.strip():
        raise StopRun("invalid_action:id")
    if not isinstance(language, str) or not language.strip():
        raise StopRun("invalid_action:language")
    if not isinstance(entrypoint, str) or not entrypoint.strip():
        raise StopRun("invalid_action:entrypoint")
    if not isinstance(code, str) or not code.strip():
        raise StopRun("invalid_action:code")
    if len(code) > max_code_chars:
        raise StopRun("invalid_action:code_too_long")
    if not isinstance(input_payload, dict):
        raise StopRun("invalid_action:input_payload")

    normalized_entrypoint = entrypoint.strip()
    if "/" in normalized_entrypoint or "\\" in normalized_entrypoint:
        raise StopRun("invalid_action:entrypoint_path")
    if normalized_entrypoint != "main.py":
        raise StopRun("invalid_action:entrypoint_denied")

    return {
        "id": action_id.strip(),
        "language": language.strip().lower(),
        "entrypoint": normalized_entrypoint,
        "code": code,
        "input_payload": input_payload,
    }


def validate_execution_output(raw: Any) -> dict[str, Any]:
    if not isinstance(raw, dict):
        raise StopRun("invalid_code_output:not_object")

    required = {
        "failed_payment_rate",
        "chargeback_alerts",
        "incident_severity",
        "eta_minutes",
        "affected_checkout_share",
        "avg_latency_ms",
        "p95_latency_ms",
        "sample_size",
        "incident_id",
        "region",
    }
    if not required.issubset(set(raw.keys())):
        raise StopRun("invalid_code_output:missing_required")

    failed_rate = raw["failed_payment_rate"]
    if not isinstance(failed_rate, (int, float)) or not (0 <= float(failed_rate) <= 1):
        raise StopRun("invalid_code_output:failed_payment_rate")

    share = raw["affected_checkout_share"]
    if not isinstance(share, (int, float)) or not (0 <= float(share) <= 1):
        raise StopRun("invalid_code_output:affected_checkout_share")

    chargeback_alerts = raw["chargeback_alerts"]
    if not isinstance(chargeback_alerts, int) or chargeback_alerts < 0:
        raise StopRun("invalid_code_output:chargeback_alerts")

    severity = raw["incident_severity"]
    if severity not in {"P1", "P2", "P3"}:
        raise StopRun("invalid_code_output:incident_severity")

    eta = raw["eta_minutes"]
    if not isinstance(eta, int) or eta < 0 or eta > 240:
        raise StopRun("invalid_code_output:eta_minutes")

    sample_size = raw["sample_size"]
    if not isinstance(sample_size, int) or sample_size <= 0:
        raise StopRun("invalid_code_output:sample_size")

    try:
        avg_latency = round(float(raw["avg_latency_ms"]), 2)
    except (TypeError, ValueError):
        raise StopRun("invalid_code_output:avg_latency_ms")
    try:
        p95_latency = round(float(raw["p95_latency_ms"]), 2)
    except (TypeError, ValueError):
        raise StopRun("invalid_code_output:p95_latency_ms")

    return {
        "failed_payment_rate": float(failed_rate),
        "chargeback_alerts": chargeback_alerts,
        "incident_severity": severity,
        "eta_minutes": eta,
        "affected_checkout_share": float(share),
        "avg_latency_ms": avg_latency,
        "p95_latency_ms": p95_latency,
        "sample_size": sample_size,
        "incident_id": str(raw["incident_id"]),
        "region": str(raw["region"]).upper(),
    }


def _static_policy_violations(code: str) -> list[str]:
    lower = code.lower()
    violations: list[str] = []
    input_calls = 0

    if "http://" in lower or "https://" in lower:
        violations.append("network_literal_blocked")

    try:
        tree = ast.parse(code)
    except SyntaxError:
        return ["syntax_error"]

    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                module = alias.name.split(".")[0]
                if module not in ALLOWED_IMPORTS:
                    violations.append(f"import_not_allowed:{module}")
        elif isinstance(node, ast.ImportFrom):
            module = (node.module or "").split(".")[0]
            if module not in ALLOWED_IMPORTS:
                violations.append(f"import_not_allowed:{module or 'relative'}")
        elif isinstance(node, ast.Name):
            if node.id.startswith("__"):
                violations.append(f"name_not_allowed:{node.id}")
            elif node.id in DENIED_NAME_REFERENCES:
                violations.append(f"name_not_allowed:{node.id}")
            elif node.id in DENIED_GLOBAL_NAMES:
                violations.append(f"name_not_allowed:{node.id}")
        elif isinstance(node, ast.Attribute):
            if node.attr.startswith("__"):
                violations.append(f"attr_not_allowed:{node.attr}")
            elif node.attr in DENIED_CALL_ATTRS:
                violations.append(f"attr_not_allowed:{node.attr}")
        elif isinstance(node, ast.Call):
            if isinstance(node.func, ast.Name) and node.func.id in DENIED_CALL_NAMES:
                violations.append(f"call_not_allowed:{node.func.id}")
            elif isinstance(node.func, ast.Name) and node.func.id in {"getattr", "setattr", "delattr"}:
                violations.append(f"call_not_allowed:{node.func.id}")
            elif isinstance(node.func, ast.Name) and node.func.id == "input":
                input_calls += 1
            elif isinstance(node.func, ast.Attribute) and node.func.attr in DENIED_CALL_ATTRS:
                violations.append(f"call_not_allowed:{node.func.attr}")

    if input_calls > 1:
        violations.append("too_many_input_reads")

    return sorted(set(violations))


class CodeExecutionGateway:
    def __init__(
        self,
        *,
        allowed_languages_policy: set[str],
        allowed_languages_execution: set[str],
        budget: Budget,
    ):
        self.allowed_languages_policy = {item.lower() for item in allowed_languages_policy}
        self.allowed_languages_execution = {item.lower() for item in allowed_languages_execution}
        self.budget = budget

    def evaluate(self, *, action: dict[str, Any]) -> Decision:
        language = action["language"]
        if language not in self.allowed_languages_policy:
            return Decision(kind="deny", reason="language_denied_policy")
        if language not in self.allowed_languages_execution:
            return Decision(kind="deny", reason="language_denied_execution")

        violations = _static_policy_violations(action["code"])
        if violations:
            return Decision(kind="deny", reason=f"static_policy_violation:{','.join(violations[:3])}")
        return Decision(kind="allow", reason="policy_pass")

    def execute_python(self, *, code: str, entrypoint: str, input_payload: dict[str, Any]) -> dict[str, Any]:
        if entrypoint != "main.py":
            raise StopRun("invalid_action:entrypoint_denied")

        with tempfile.TemporaryDirectory(prefix="code_exec_agent_") as temp_dir:
            script_path = Path(temp_dir) / entrypoint
            script_path.write_text(code, encoding="utf-8")

            # Minimal hardening for interpreter behavior in this demo boundary.
            proc_env = os.environ.copy()
            proc_env["PYTHONNOUSERSITE"] = "1"
            proc_env["PYTHONDONTWRITEBYTECODE"] = "1"

            started = time.monotonic()
            try:
                proc = subprocess.run(
                    [sys.executable, str(script_path)],
                    input=json.dumps(input_payload),
                    text=True,
                    encoding="utf-8",
                    errors="replace",
                    capture_output=True,
                    timeout=self.budget.exec_timeout_seconds,
                    cwd=temp_dir,
                    env=proc_env,
                )
            except subprocess.TimeoutExpired as exc:
                raise StopRun("code_timeout") from exc

            exec_ms = int((time.monotonic() - started) * 1000)
            stdout = proc.stdout or ""
            stderr = proc.stderr or ""
            # Demo limitation: size caps are checked after process completion.
            stdout_bytes = len(stdout.encode("utf-8"))
            stderr_bytes = len(stderr.encode("utf-8"))
            if stdout_bytes > self.budget.max_stdout_bytes:
                raise StopRun("code_output_too_large")
            if stderr_bytes > self.budget.max_stderr_bytes:
                raise StopRun("code_stderr_too_large")
            if proc.returncode != 0:
                stderr_snippet = stderr.strip().replace("\n", " ")[:200]
                stdout_snippet = stdout.strip().replace("\n", " ")[:200]
                details: dict[str, str] = {}
                if stderr_snippet:
                    details["stderr_snippet"] = stderr_snippet
                if stdout_snippet:
                    details["stdout_snippet"] = stdout_snippet
                raise StopRun(
                    f"code_runtime_error:{proc.returncode}",
                    details=details,
                )

            stdout = stdout.strip()
            if not stdout:
                raise StopRun("invalid_code_output:empty_stdout")

            try:
                payload = json.loads(stdout)
            except json.JSONDecodeError as exc:
                raise StopRun("invalid_code_output:non_json") from exc

            if not isinstance(payload, dict):
                raise StopRun("invalid_code_output:not_object")

            return {
                "payload": payload,
                "exec_ms": exec_ms,
                "stdout_bytes": stdout_bytes,
                "stderr_bytes": stderr_bytes,
            }

`main.py` — orchestrate code-execution flow

PYTHON

from __future__ import annotations

import json
import math
import time
import uuid
from typing import Any

from agent import compose_final_answer, propose_code_execution_plan
from context import build_request
from gateway import (
    Budget,
    CodeExecutionGateway,
    StopRun,
    code_hash,
    validate_code_action,
    validate_execution_output,
)

GOAL = (
    "Run a safe code task to compute incident metrics from payment transactions "
    "and return an operations-ready summary."
)
REQUEST = build_request(
    report_date="2026-03-07",
    region="US",
    incident_id="inc_payments_20260307",
)

DEFAULT_BUDGET = Budget(
    max_seconds=25,
    max_code_chars=2400,
    exec_timeout_seconds=2.0,
    max_stdout_bytes=4096,
    max_stderr_bytes=4096,
)

DEFAULT_ALLOWED_LANGUAGES_POLICY = {"python", "javascript"}
ALLOWED_LANGUAGES_EXECUTION = {"python"}


def run_code_execution_agent(*, goal: str, request: dict[str, Any]) -> dict[str, Any]:
    run_id = str(uuid.uuid4())
    started = time.monotonic()
    trace: list[dict[str, Any]] = []
    history: list[dict[str, Any]] = []

    hints_raw = request.get("policy_hints")
    hints: dict[str, Any] = hints_raw if isinstance(hints_raw, dict) else {}
    network_access = str(hints.get("network_access", "denied")).strip().lower()
    if network_access not in {"denied", "none", "off"}:
        return {
            "run_id": run_id,
            "status": "stopped",
            "stop_reason": "invalid_request:network_access_must_be_denied",
            "phase": "plan",
            "trace": trace,
            "history": history,
        }

    allowed_policy_raw = hints.get("allowed_languages")
    if isinstance(allowed_policy_raw, list):
        allowed_policy = {
            str(item).strip().lower()
            for item in allowed_policy_raw
            if isinstance(item, str) and item.strip()
        }
    else:
        allowed_policy = set(DEFAULT_ALLOWED_LANGUAGES_POLICY)
    if not allowed_policy:
        allowed_policy = set(DEFAULT_ALLOWED_LANGUAGES_POLICY)

    max_code_chars_raw = hints.get("max_code_chars", DEFAULT_BUDGET.max_code_chars)
    exec_timeout_raw = hints.get("exec_timeout_seconds", DEFAULT_BUDGET.exec_timeout_seconds)
    try:
        max_code_chars = int(max_code_chars_raw)
    except (TypeError, ValueError):
        max_code_chars = DEFAULT_BUDGET.max_code_chars
    try:
        exec_timeout_seconds = float(exec_timeout_raw)
        if not math.isfinite(exec_timeout_seconds):
            raise ValueError
    except (TypeError, ValueError):
        exec_timeout_seconds = DEFAULT_BUDGET.exec_timeout_seconds

    budget = Budget(
        max_seconds=DEFAULT_BUDGET.max_seconds,
        max_code_chars=max(200, min(8000, max_code_chars)),
        exec_timeout_seconds=max(0.2, min(20.0, exec_timeout_seconds)),
        max_stdout_bytes=DEFAULT_BUDGET.max_stdout_bytes,
        max_stderr_bytes=DEFAULT_BUDGET.max_stderr_bytes,
    )

    gateway = CodeExecutionGateway(
        allowed_languages_policy=allowed_policy,
        allowed_languages_execution=ALLOWED_LANGUAGES_EXECUTION,
        budget=budget,
    )

    def stopped(stop_reason: str, *, phase: str, **extra: Any) -> dict[str, Any]:
        payload = {
            "run_id": run_id,
            "status": "stopped",
            "stop_reason": stop_reason,
            "phase": phase,
            "trace": trace,
            "history": history,
        }
        payload.update(extra)
        return payload

    phase = "plan"
    try:
        if (time.monotonic() - started) > budget.max_seconds:
            return stopped("max_seconds", phase=phase)

        raw_plan = propose_code_execution_plan(goal=goal, request=request)
        action = validate_code_action(raw_plan.get("action"), max_code_chars=budget.max_code_chars)
        generated_code_hash = code_hash(action["code"])

        trace.append(
            {
                "step": 1,
                "phase": "plan_code",
                "action_id": action["id"],
                "language": action["language"],
                "code_hash": generated_code_hash,
                "chars": len(action["code"]),
                "ok": True,
            }
        )
        history.append(
            {
                "step": 1,
                "action": "propose_code_execution_plan",
                "proposed_action": {
                    "id": action["id"],
                    "language": action["language"],
                    "entrypoint": action["entrypoint"],
                    "code_hash": generated_code_hash,
                },
            }
        )

        phase = "policy_check"
        decision = gateway.evaluate(action=action)
        trace.append(
            {
                "step": 2,
                "phase": "policy_check",
                "decision": decision.kind,
                "reason": decision.reason,
                "allowed_languages_policy": sorted(allowed_policy),
                "allowed_languages_execution": sorted(ALLOWED_LANGUAGES_EXECUTION),
                "ok": decision.kind == "allow",
            }
        )
        history.append(
            {
                "step": 2,
                "action": "policy_check",
                "decision": {
                    "kind": decision.kind,
                    "reason": decision.reason,
                },
            }
        )
        if decision.kind != "allow":
            return stopped(f"policy_block:{decision.reason}", phase=phase)

        if (time.monotonic() - started) > budget.max_seconds:
            return stopped("max_seconds", phase="execute")

        phase = "execute"
        execute_trace = {
            "step": 3,
            "phase": "execute_code",
            "language": action["language"],
            "code_hash": generated_code_hash,
            "ok": False,
        }
        trace.append(execute_trace)
        try:
            execution = gateway.execute_python(
                code=action["code"],
                entrypoint=action["entrypoint"],
                input_payload=action["input_payload"],
            )
            validated = validate_execution_output(execution["payload"])
        except StopRun as exc:
            execute_trace["error"] = exc.reason
            details = exc.details if isinstance(exc.details, dict) else {}
            stderr_snippet = str(details.get("stderr_snippet", "")).strip()
            stdout_snippet = str(details.get("stdout_snippet", "")).strip()
            if stderr_snippet:
                execute_trace["stderr_snippet"] = stderr_snippet
            if stdout_snippet:
                execute_trace["stdout_snippet"] = stdout_snippet
            history.append(
                {
                    "step": 3,
                    "action": "execute_code",
                    "status": "error",
                    "reason": exc.reason,
                    **({"stderr_snippet": stderr_snippet} if stderr_snippet else {}),
                    **({"stdout_snippet": stdout_snippet} if stdout_snippet else {}),
                }
            )
            raise

        execute_trace["stdout_bytes"] = execution["stdout_bytes"]
        execute_trace["stderr_bytes"] = execution["stderr_bytes"]
        execute_trace["exec_ms"] = execution["exec_ms"]
        execute_trace["ok"] = True
        history.append(
            {
                "step": 3,
                "action": "execute_code",
                "result": validated,
            }
        )

        aggregate = {
            "report_date": request["request"]["report_date"],
            "region": request["request"]["region"],
            "incident_id": request["request"]["incident_id"],
            "metrics": {
                "incident_severity": validated["incident_severity"],
                "failed_payment_rate": round(validated["failed_payment_rate"], 6),
                "failed_payment_rate_pct": round(validated["failed_payment_rate"] * 100, 2),
                "chargeback_alerts": validated["chargeback_alerts"],
                "eta_minutes": validated["eta_minutes"],
                "affected_checkout_share": round(validated["affected_checkout_share"], 6),
                "affected_checkout_share_pct": round(validated["affected_checkout_share"] * 100, 2),
                "avg_latency_ms": validated["avg_latency_ms"],
                "p95_latency_ms": validated["p95_latency_ms"],
                "sample_size": validated["sample_size"],
            },
        }
        execution_summary = {
            "language": action["language"],
            "code_hash": generated_code_hash,
            "exec_ms": execution["exec_ms"],
            "stdout_bytes": execution["stdout_bytes"],
            "stderr_bytes": execution["stderr_bytes"],
        }
        answer = compose_final_answer(
            request=request,
            aggregate=aggregate,
            execution_summary=execution_summary,
        )

        trace.append(
            {
                "step": 4,
                "phase": "finalize",
                "ok": True,
            }
        )
        history.append(
            {
                "step": 4,
                "action": "finalize",
            }
        )

        return {
            "run_id": run_id,
            "status": "ok",
            "stop_reason": "success",
            "outcome": "code_execution_success",
            "answer": answer,
            "proposed_action": {
                "id": action["id"],
                "language": action["language"],
                "entrypoint": action["entrypoint"],
                "code_hash": generated_code_hash,
            },
            "aggregate": aggregate,
            "execution": execution_summary,
            "trace": trace,
            "history": history,
        }
    except StopRun as exc:
        return stopped(exc.reason, phase=phase)


def main() -> None:
    result = run_code_execution_agent(goal=GOAL, request=REQUEST)
    print(json.dumps(result, indent=2, ensure_ascii=False))


if __name__ == "__main__":
    main()

What matters most here (in plain words)

the agent may propose code, but the execution boundary decides whether it can be run at all
policy check blocks unsafe code before execution
policy also blocks any dunder references (__*) as demo hardening
execution runs in a separate subprocess boundary (best-effort, not a security sandbox) with timeout and output limit
output is not trusted "as is": it goes through schema validation
business response is formed only from validated output

Example Output

JSON

{
  "run_id": "5506a01d-01a1-47e8-88c9-24eb0d1dff39",
  "status": "ok",
  "stop_reason": "success",
  "outcome": "code_execution_success",
  "answer": "Code execution brief (US, 2026-03-07): incident inc_payments_20260307 is P1 with failed payments at 3.33% and 1 chargeback alerts. Affected checkout share is 3.33%, average latency is 167.0 ms (p95 187.0 ms), and ETA is ~45 minutes. Executed in a separate subprocess boundary (best-effort, not a security sandbox) (25 ms, 269 stdout bytes, 0 stderr bytes).",
  "proposed_action": {
    "id": "c1",
    "language": "python",
    "entrypoint": "main.py",
    "code_hash": "47a5c01dd74664c7"
  },
  "aggregate": {
    "report_date": "2026-03-07",
    "region": "US",
    "incident_id": "inc_payments_20260307",
    "metrics": {
      "incident_severity": "P1",
      "failed_payment_rate": 0.033333,
      "failed_payment_rate_pct": 3.33,
      "chargeback_alerts": 1,
      "eta_minutes": 45,
      "affected_checkout_share": 0.033333,
      "affected_checkout_share_pct": 3.33,
      "avg_latency_ms": 167.0,
      "p95_latency_ms": 187.0,
      "sample_size": 60
    }
  },
  "execution": {
    "language": "python",
    "code_hash": "47a5c01dd74664c7",
    "exec_ms": 25,
    "stdout_bytes": 269,
    "stderr_bytes": 0
  },
  "trace": [
    {
      "step": 1,
      "phase": "plan_code",
      "action_id": "c1",
      "language": "python",
      "code_hash": "47a5c01dd74664c7",
      "chars": 1229,
      "ok": true
    },
    {
      "step": 2,
      "phase": "policy_check",
      "decision": "allow",
      "reason": "policy_pass",
      "allowed_languages_policy": ["javascript", "python"],
      "allowed_languages_execution": ["python"],
      "ok": true
    },
    {
      "step": 3,
      "phase": "execute_code",
      "language": "python",
      "code_hash": "47a5c01dd74664c7",
      "ok": true,
      "stdout_bytes": 269,
      "stderr_bytes": 0,
      "exec_ms": 25
    },
    {
      "step": 4,
      "phase": "finalize",
      "ok": true
    }
  ],
  "history": [{...}]
}

Typical `stop_reason`

success - run completed correctly
max_seconds - total time budget exhausted
invalid_action:* - invalid action contract
invalid_action:entrypoint_path - entrypoint contains / or \ (path traversal attempt)
invalid_action:entrypoint_denied - entrypoint is not in allowlist (only main.py is allowed)
policy_block:language_denied_policy - language is denied by policy allowlist
policy_block:language_denied_execution - language is denied by execution allowlist
policy_block:static_policy_violation:* - static code check found a denied operation
code_timeout - code execution exceeded timeout
code_runtime_error:* - code exited with a runtime error
code_output_too_large - stdout exceeded allowed output budget
code_stderr_too_large - stderr exceeded allowed output budget
invalid_code_output:* - code result failed schema validation
invalid_request:network_access_must_be_denied - this example accepts only network_access=denied

What Is NOT Shown Here

container-level isolation (seccomp/cgroup/jail)
filesystem / cpu / memory isolation (this example is not a security sandbox)
real network isolation at runtime (here network_access=denied is only a contract-level check)
streaming caps with kill-on-limit for stdout/stderr (in demo, cap is checked post-factum after process completion)
artifact storage for generated code/result snapshots
multi-step repair loop (regenerating code after error)
human approval for risky execution plans
full DoS mitigation (file-spam / heavy allocations / algorithmic bombs)

What to Try Next

Replace language with javascript and observe policy_block:language_denied_execution.
Add import os to code and observe policy_block:static_policy_violation:import_not_allowed:os.
Add while True: pass to code and observe code_timeout.
Return "sample_size": 0 in script result and observe invalid_code_output:sample_size.
Return "incident_severity": "P0" in script result and observe invalid_code_output:incident_severity.

Code-Execution Agent in Python: Full Example

Pattern Essence (Brief)

What This Example Demonstrates

Architecture

What You Will See on Run

Project Structure

How to run

Task

Code

`context.py` — request envelope

`agent.py` — proposed action + final answer

`gateway.py` — policy + execution + output validation

`main.py` — orchestrate code-execution flow

What matters most here (in plain words)

Example Output

Typical `stop_reason`

What Is NOT Shown Here

What to Try Next

Used by patterns

Related failures

Governance required

Author

Editorial note

Code-Execution Agent in Python: Full Example

Pattern Essence (Brief)

What This Example Demonstrates

Architecture

What You Will See on Run

Project Structure

How to run

Task

Code

context.py — request envelope

agent.py — proposed action + final answer

gateway.py — policy + execution + output validation

main.py — orchestrate code-execution flow

What matters most here (in plain words)

Example Output

Typical stop_reason

What Is NOT Shown Here

What to Try Next

Used by patterns

Related failures

Governance required

Author

Editorial note

`context.py` — request envelope

`agent.py` — proposed action + final answer

`gateway.py` — policy + execution + output validation

`main.py` — orchestrate code-execution flow

Typical `stop_reason`