Why is the benign pass floor crucial in guardrails optimization?

Without the benign pass floor, the trivial solution to minimize ASR is to refuse everything. A model that refuses 100% of requests has zero ASR, but it is useless. The 2 percentage point floor forces the policy to be selective: block what is dangerous without destroying utility. It is the difference between a safe guardrail and a useless guardrail.

How does the SHA-256 manifest protect evaluation integrity?

Autoguardrails freezes eval_suite.jsonl and judge_prompt.md as fixed files. At the start of every run, a SHA-256 manifest records the hash of each fixed file. If any hash diverges during execution, execution fails immediately. This prevents the agent from optimizing the metric by altering the evaluation itself — a classic Goodhart's Law attack.

Autoguardrails: Karpathy's Autoresearch Transposed to AI Safety

Q: What is autoguardrails and how does it relate to Karpathy's autoresearch?

Autoguardrails is the transposition that Santander AI Lab made of Karpathy's autoresearch for AI safety. Autoresearch is an agent that edits train.py, trains for 5 minutes, measures val_bpb, and only saves the change if the metric improved — if it got worse, git reset. Autoguardrails applies the same turnstile, but instead of searching over train.py to minimize val_bpb, it searches over policy.md to minimize Attack Success Rate (ASR).

Q: How does autoguardrails connect to the mech-gov-framework?

Autoguardrails discovers the policy. Santander's own mech-gov-framework executes it. One finds, the other enforces. The division is deliberate: autonomous policy search is separated from policy enforcement, preventing the same system that optimizes from being the one that validates.

In March 2026, Karpathy released autoresearch: an agent that edits train.py, trains for 5 minutes, measures val_bpb, and only saves the change if the metric improved. If it got worse, git reset. Code only advances, never retreats. It is a turnstile. We saw this idea and the signal was clear: the same pattern works for AI safety. Santander AI Lab made the most elegant transposition we have seen — autoguardrails.

Karpathy''s turnstile: autoresearch

Autoresearch is simple in form, rigorous in contract. The agent edits train.py, trains for 5 minutes, measures val_bpb. If the metric improved, the change is saved. If it got worse, git reset. Code only advances, never retreats. It is a turnstile: each accepted change is a checkpoint that cannot be reverted.

The elegance is in the restriction. The agent does not search freely — it searches over a single mutable surface (train.py) to minimize a single metric (val_bpb). Everything that is not train.py is immutable. The search contract is rigid because it has to be: without rigidity, the agent finds shortcuts that do not represent real improvement. It is Goodhart''s Law applied to model training — and the solution is to restrict the search surface.

The transposition: Santander AI Lab''s autoguardrails

Santander AI Lab made the most elegant transposition we have seen for AI safety. Instead of searching over train.py to minimize val_bpb, autoguardrails searches over policy.md to minimize Attack Success Rate (ASR). The turnstile is the same. The metric is what changes.

The search contract is equally rigid: policy.md is the only mutable surface. eval_suite.jsonl and judge_prompt.md are frozen. If any fixed file changes, a SHA-256 manifest detects the deviation and execution fails. Evaluation integrity is the foundation of all optimization — without it, the agent can optimize the metric by altering the evaluation itself, which is the classic Goodhart''s Law attack. The same lesson from autoresearch, transposed to AI safety.

The acceptance rule and the benign pass floor

The acceptance rule is the heart of the design. A candidate is only accepted if ASR improves AND benign pass does not drop more than 2 percentage points. This benign pass floor is crucial. Without it, the trivial solution is to refuse everything — a model that refuses 100% of requests has zero ASR, but it is useless. With the floor, the policy must be selective: block what is dangerous without destroying utility.

This is the difference between a safe guardrail and a useless guardrail. Optimization is not just about reducing attacks — it is about reducing attacks without destroying utility. The benign pass floor transforms the problem from a one-dimensional optimization (minimize ASR) into a two-dimensional one (minimize ASR subject to maintaining utility). It is harder, but it is honest. We have seen guardrails that refuse everything in production — and the result is always the same: users bypass the guardrail, and the problem comes back worse.

The evaluation suite and the turnstile that restores

The evaluation suite has 100 attack cases across 5 categories: physical harm, cybercrime, financial crime, jailbreaks, and obfuscation (including base64 and ROT13). Plus 40 benign cases to prevent over-refusal. Zero third-party dependencies: pure Python stdlib. The simplicity is deliberate — external dependencies are attack vectors against evaluation integrity. If the suite depends on a parsing library that can be updated, the evaluation can change without anyone noticing.

And the turnstile works: if the candidate is rejected, the harness automatically restores the last accepted policy. Policies only improve, never worsen. Code only advances, never retreats — exactly like Karpathy''s autoresearch. Each accepted policy is a checkpoint that cannot be reverted. If a candidate worsens ASR or drops benign pass below the floor, it is discarded and the last accepted policy returns.

The connection to mech-gov-framework

The connection to Santander''s own mech-gov-framework completes the arc. Autoguardrails discovers the policy. Mech-gov executes it. One finds, the other enforces. The division is deliberate: autonomous policy search is separated from policy enforcement, preventing the same system that optimizes from being the one that validates.

This separation is a good practice in AI safety architecture. The system that discovers the policy is not the system that enforces it. If autoguardrails is compromised, mech-gov still enforces the last accepted policy. If mech-gov fails, autoguardrails can still discover new policies. The failure of one does not bring down the other. It is defense in depth applied to AI governance.

Conclusion: the metric is not loss, it is selective refusal

The insight that remains is clear. The same autonomous search pattern that Karpathy applied to model training works for alignment. The difference is that in AI safety, the metric is not loss. It is how effectively your model refuses what it should refuse without refusing what it should not. The turnstile is the same — the metric is what changes.

At Tech86, we help companies implement guardrails that are selective, not destructive. Autoguardrails shows that autonomous policy search is viable — as long as the search contract is rigid, the evaluation is integral, and the benign pass floor is respected. Without these three pillars, optimization becomes over-refusal. With them, it becomes alignment. The repository is at github.com/SantanderAI/autoguardrails.

Autoguardrails: Karpathy's Autoresearch Transposed to AI Safety

Karpathy''s turnstile: autoresearch

The transposition: Santander AI Lab''s autoguardrails

The acceptance rule and the benign pass floor

The evaluation suite and the turnstile that restores

The connection to mech-gov-framework

Conclusion: the metric is not loss, it is selective refusal

Frequently Asked Questions

What is autoguardrails and how does it relate to Karpathy's autoresearch?

Why is the benign pass floor crucial in guardrails optimization?

How does the SHA-256 manifest protect evaluation integrity?

How does autoguardrails connect to the mech-gov-framework?

What is the key insight from transposing autoresearch to AI safety?

Blog — Get in Touch

Schedule a Meeting

Email

WhatsApp

Address

Tech86 Specialist

We Value Your Privacy