Assumption-Carrying Execution

Turn agent assumptions into executable runtime leases.

ACE is a deterministic pre-execution gate for agent side effects. It checks the action hash, approval state, expiry, and evidence predicates before tools can mutate the world.

ACE preflight 100.0%

6,114 / 6,114 policy probes from ST-WebAgentBench.

Execute-all baseline 50.0%

Policy-blind execution passes compliant probes and fails violations.

Keyword guard 61.7%

A simple lexical safety filter misses benchmark-specific predicates.

How ACE works

The model may propose actions, but the final permission decision is deterministic. If the current evidence does not satisfy the lease, the action is denied or deferred.

Agent action
Lease
Evidence
ACE gate
Permit / deny / defer

Quick start

Install locally and run the included stale-approval demo.

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

ace-runtime demo
pytest

Public-policy benchmark

Download the public ST-WebAgentBench policy file and run the deterministic preflight benchmark.

ace-runtime benchmark-stwebagentbench \
  --download-if-missing \
  --data data/stwebagentbench/test.raw.json \
  --output-dir results/stwebagentbench-ace-preflight

What the benchmark proves

It proves that explicit policy rows can be compiled into leases and enforced before execution. It does not claim official browser-agent leaderboard performance.

Use ACE for

  • tool-call authorization
  • human approval leases
  • browser-agent side effects
  • deployment and workflow gates
  • policy-bound publication

Do not overclaim

  • ACE validates evidence, not reality.
  • Wrong policy compilation means wrong enforcement.
  • Bypassable tools break the guarantee.
  • This is not an MMLU-style reasoning trick.