6,114 / 6,114 policy probes from ST-WebAgentBench.
Turn agent assumptions into executable runtime leases.
ACE is a deterministic pre-execution gate for agent side effects. It checks the action hash, approval state, expiry, and evidence predicates before tools can mutate the world.
Policy-blind execution passes compliant probes and fails violations.
A simple lexical safety filter misses benchmark-specific predicates.
How ACE works
The model may propose actions, but the final permission decision is deterministic. If the current evidence does not satisfy the lease, the action is denied or deferred.
The project story
The core question was not "how do we make the model sound smarter?" It was "how do we stop an agent from executing after the justification for an action has gone stale?"
Agents act on stale approvals, stale scope, and stale evidence.
Bind each action to a lease over explicit predicates.
Validate action hash, expiry, and evidence predicates before execution.
Use a public policy benchmark instead of a custom private scaffold.
Quick start
Install locally and run the included stale-approval demo.
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
ace-runtime demo
pytest
Public-policy benchmark
Download the public ST-WebAgentBench policy file and run the deterministic preflight benchmark.
ace-runtime benchmark-stwebagentbench \
--download-if-missing \
--data data/stwebagentbench/test.raw.json \
--output-dir results/stwebagentbench-ace-preflight
Experiment ladder
The final public benchmark is the end of a chain of experiments, not the only thing we tried.
- Theory motivation: agents become dangerous when they keep acting under stale assumptions.
- Tool-call validation: runtime checking can improve exact execution on structured calls.
- Qualitative gating case studies: a generated artifact can look plausible and still violate publish-time assumptions.
- Public benchmark: compile public ST-WebAgentBench policy rows into executable leases.
What the benchmark proves
It proves that explicit policy rows can be compiled into leases and enforced before execution. It does not claim official browser-agent leaderboard performance.
- 3,057 public policy instances compiled
- 6,114 paired violation/compliance probes
- Source hash included for audit
- No LLM endpoint required
Repository contents
The public repo stays focused on the runtime, the benchmark runner, examples, and the documentation needed to reproduce the core claim.
src/ace_runtime/lease.pysrc/ace_runtime/stwebagentbench.pyexamples/docs/BENCHMARKS.md
What worked
- the core validator stayed very small
- the benchmark became auditable and reproducible
- the claim stayed narrow enough to defend
- the result does not depend on a private endpoint
What failed or stayed out
- custom harnesses were weaker as public proof
- live endpoint comparisons were too fragile
- official browser-agent leaderboard performance is still future work
- ACE validates evidence, not the world itself
Use ACE for
- tool-call authorization
- human approval leases
- browser-agent side effects
- deployment and workflow gates
- policy-bound publication
Do not overclaim
- ACE validates evidence, not reality.
- Wrong policy compilation means wrong enforcement.
- Bypassable tools break the guarantee.
- This is not an MMLU-style reasoning trick.
Next step
The next real milestone is to insert ACE into a live browser-agent or tool-agent loop and measure end-to-end policy-aware performance: task success, violations, overblocking, and revalidation burden.