Assumption-Carrying Execution

Turn agent assumptions into executable runtime leases.

ACE is a deterministic pre-execution gate for agent side effects. It checks the action hash, approval state, expiry, and evidence predicates before tools can mutate the world.

Quick start Benchmark

Runtime library Public benchmark Examples Deployed docs

ACE preflight 100.0%

6,114 / 6,114 policy probes from ST-WebAgentBench.

Execute-all baseline 50.0%

Policy-blind execution passes compliant probes and fails violations.

Keyword guard 61.7%

A simple lexical safety filter misses benchmark-specific predicates.

How ACE works

The model may propose actions, but the final permission decision is deterministic. If the current evidence does not satisfy the lease, the action is denied or deferred.

Agent action

Lease

Evidence

ACE gate

Permit / deny / defer

The project story

The core question was not "how do we make the model sound smarter?" It was "how do we stop an agent from executing after the justification for an action has gone stale?"

Problem

Agents act on stale approvals, stale scope, and stale evidence.

Idea

Bind each action to a lease over explicit predicates.

Runtime

Validate action hash, expiry, and evidence predicates before execution.

Proof

Use a public policy benchmark instead of a custom private scaffold.

Quick start

Install locally and run the included stale-approval demo.

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

ace-runtime demo
pytest

Public-policy benchmark

Download the public ST-WebAgentBench policy file and run the deterministic preflight benchmark.

ace-runtime benchmark-stwebagentbench \
  --download-if-missing \
  --data data/stwebagentbench/test.raw.json \
  --output-dir results/stwebagentbench-ace-preflight

Experiment ladder

The final public benchmark is the end of a chain of experiments, not the only thing we tried.

Theory motivation: agents become dangerous when they keep acting under stale assumptions.
Tool-call validation: runtime checking can improve exact execution on structured calls.
Qualitative gating case studies: a generated artifact can look plausible and still violate publish-time assumptions.
Public benchmark: compile public ST-WebAgentBench policy rows into executable leases.

What the benchmark proves

It proves that explicit policy rows can be compiled into leases and enforced before execution. It does not claim official browser-agent leaderboard performance.

3,057 public policy instances compiled
6,114 paired violation/compliance probes
Source hash included for audit
No LLM endpoint required

Repository contents

The public repo stays focused on the runtime, the benchmark runner, examples, and the documentation needed to reproduce the core claim.

src/ace_runtime/lease.py
src/ace_runtime/stwebagentbench.py
examples/
docs/BENCHMARKS.md

What worked

the core validator stayed very small
the benchmark became auditable and reproducible
the claim stayed narrow enough to defend
the result does not depend on a private endpoint

What failed or stayed out

custom harnesses were weaker as public proof
live endpoint comparisons were too fragile
official browser-agent leaderboard performance is still future work
ACE validates evidence, not the world itself

Use ACE for

tool-call authorization
human approval leases
browser-agent side effects
deployment and workflow gates
policy-bound publication

Do not overclaim

ACE validates evidence, not reality.
Wrong policy compilation means wrong enforcement.
Bypassable tools break the guarantee.
This is not an MMLU-style reasoning trick.

Next step

The next real milestone is to insert ACE into a live browser-agent or tool-agent loop and measure end-to-end policy-aware performance: task success, violations, overblocking, and revalidation burden.

live tool mediation

evidence provenance

receipts and audit logs

human re-approval flows