6,114 / 6,114 policy probes from ST-WebAgentBench.
Turn agent assumptions into executable runtime leases.
ACE is a deterministic pre-execution gate for agent side effects. It checks the action hash, approval state, expiry, and evidence predicates before tools can mutate the world.
Policy-blind execution passes compliant probes and fails violations.
A simple lexical safety filter misses benchmark-specific predicates.
How ACE works
The model may propose actions, but the final permission decision is deterministic. If the current evidence does not satisfy the lease, the action is denied or deferred.
Quick start
Install locally and run the included stale-approval demo.
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
ace-runtime demo
pytest
Public-policy benchmark
Download the public ST-WebAgentBench policy file and run the deterministic preflight benchmark.
ace-runtime benchmark-stwebagentbench \
--download-if-missing \
--data data/stwebagentbench/test.raw.json \
--output-dir results/stwebagentbench-ace-preflight
What the benchmark proves
It proves that explicit policy rows can be compiled into leases and enforced before execution. It does not claim official browser-agent leaderboard performance.
- 3,057 public policy instances compiled
- 6,114 paired violation/compliance probes
- Source hash included for audit
- No LLM endpoint required
Use ACE for
- tool-call authorization
- human approval leases
- browser-agent side effects
- deployment and workflow gates
- policy-bound publication
Do not overclaim
- ACE validates evidence, not reality.
- Wrong policy compilation means wrong enforcement.
- Bypassable tools break the guarantee.
- This is not an MMLU-style reasoning trick.