Assumption-Carrying Execution

Turn agent assumptions into executable runtime leases.

ACE is a deterministic pre-execution gate for agent side effects. It checks the action hash, approval state, expiry, and evidence predicates before tools can mutate the world.

Runtime library Public benchmark Examples Deployed docs
ACE preflight 100.0%

6,114 / 6,114 policy probes from ST-WebAgentBench.

Execute-all baseline 50.0%

Policy-blind execution passes compliant probes and fails violations.

Keyword guard 61.7%

A simple lexical safety filter misses benchmark-specific predicates.

How ACE works

The model may propose actions, but the final permission decision is deterministic. If the current evidence does not satisfy the lease, the action is denied or deferred.

Agent action
Lease
Evidence
ACE gate
Permit / deny / defer

The project story

The core question was not "how do we make the model sound smarter?" It was "how do we stop an agent from executing after the justification for an action has gone stale?"

Problem

Agents act on stale approvals, stale scope, and stale evidence.

Idea

Bind each action to a lease over explicit predicates.

Runtime

Validate action hash, expiry, and evidence predicates before execution.

Proof

Use a public policy benchmark instead of a custom private scaffold.

Quick start

Install locally and run the included stale-approval demo.

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

ace-runtime demo
pytest

Public-policy benchmark

Download the public ST-WebAgentBench policy file and run the deterministic preflight benchmark.

ace-runtime benchmark-stwebagentbench \
  --download-if-missing \
  --data data/stwebagentbench/test.raw.json \
  --output-dir results/stwebagentbench-ace-preflight

Experiment ladder

The final public benchmark is the end of a chain of experiments, not the only thing we tried.

  1. Theory motivation: agents become dangerous when they keep acting under stale assumptions.
  2. Tool-call validation: runtime checking can improve exact execution on structured calls.
  3. Qualitative gating case studies: a generated artifact can look plausible and still violate publish-time assumptions.
  4. Public benchmark: compile public ST-WebAgentBench policy rows into executable leases.

What the benchmark proves

It proves that explicit policy rows can be compiled into leases and enforced before execution. It does not claim official browser-agent leaderboard performance.

Repository contents

The public repo stays focused on the runtime, the benchmark runner, examples, and the documentation needed to reproduce the core claim.

What worked

  • the core validator stayed very small
  • the benchmark became auditable and reproducible
  • the claim stayed narrow enough to defend
  • the result does not depend on a private endpoint

What failed or stayed out

  • custom harnesses were weaker as public proof
  • live endpoint comparisons were too fragile
  • official browser-agent leaderboard performance is still future work
  • ACE validates evidence, not the world itself

Use ACE for

  • tool-call authorization
  • human approval leases
  • browser-agent side effects
  • deployment and workflow gates
  • policy-bound publication

Do not overclaim

  • ACE validates evidence, not reality.
  • Wrong policy compilation means wrong enforcement.
  • Bypassable tools break the guarantee.
  • This is not an MMLU-style reasoning trick.

Next step

The next real milestone is to insert ACE into a live browser-agent or tool-agent loop and measure end-to-end policy-aware performance: task success, violations, overblocking, and revalidation burden.

live tool mediation
evidence provenance
receipts and audit logs
human re-approval flows