Pular para o conteúdo principal
Close
AI

Autoguardrails: Karpathy's Autoresearch Transposed to AI Safety

Gabriel Ferraresi· CEO | Tech86July 3, 20264 min
aiai-safetykarpathysantanderautoguardrailsautoresearchguardrails

In March 2026, Karpathy released autoresearch: an agent that edits train.py, trains for 5 minutes, measures val_bpb, and only saves the change if the metric improved. If it got worse, git reset. Code only advances, never retreats. It is a turnstile. We saw this idea and the signal was clear: the same pattern works for AI safety. Santander AI Lab made the most elegant transposition we have seen — autoguardrails.

Karpathy''s turnstile: autoresearch

Autoresearch is simple in form, rigorous in contract. The agent edits train.py, trains for 5 minutes, measures val_bpb. If the metric improved, the change is saved. If it got worse, git reset. Code only advances, never retreats. It is a turnstile: each accepted change is a checkpoint that cannot be reverted.

The elegance is in the restriction. The agent does not search freely — it searches over a single mutable surface (train.py) to minimize a single metric (val_bpb). Everything that is not train.py is immutable. The search contract is rigid because it has to be: without rigidity, the agent finds shortcuts that do not represent real improvement. It is Goodhart''s Law applied to model training — and the solution is to restrict the search surface.

The transposition: Santander AI Lab''s autoguardrails

Santander AI Lab made the most elegant transposition we have seen for AI safety. Instead of searching over train.py to minimize val_bpb, autoguardrails searches over policy.md to minimize Attack Success Rate (ASR). The turnstile is the same. The metric is what changes.

The search contract is equally rigid: policy.md is the only mutable surface. eval_suite.jsonl and judge_prompt.md are frozen. If any fixed file changes, a SHA-256 manifest detects the deviation and execution fails. Evaluation integrity is the foundation of all optimization — without it, the agent can optimize the metric by altering the evaluation itself, which is the classic Goodhart''s Law attack. The same lesson from autoresearch, transposed to AI safety.

The acceptance rule and the benign pass floor

The acceptance rule is the heart of the design. A candidate is only accepted if ASR improves AND benign pass does not drop more than 2 percentage points. This benign pass floor is crucial. Without it, the trivial solution is to refuse everything — a model that refuses 100% of requests has zero ASR, but it is useless. With the floor, the policy must be selective: block what is dangerous without destroying utility.

This is the difference between a safe guardrail and a useless guardrail. Optimization is not just about reducing attacks — it is about reducing attacks without destroying utility. The benign pass floor transforms the problem from a one-dimensional optimization (minimize ASR) into a two-dimensional one (minimize ASR subject to maintaining utility). It is harder, but it is honest. We have seen guardrails that refuse everything in production — and the result is always the same: users bypass the guardrail, and the problem comes back worse.

The evaluation suite and the turnstile that restores

The evaluation suite has 100 attack cases across 5 categories: physical harm, cybercrime, financial crime, jailbreaks, and obfuscation (including base64 and ROT13). Plus 40 benign cases to prevent over-refusal. Zero third-party dependencies: pure Python stdlib. The simplicity is deliberate — external dependencies are attack vectors against evaluation integrity. If the suite depends on a parsing library that can be updated, the evaluation can change without anyone noticing.

And the turnstile works: if the candidate is rejected, the harness automatically restores the last accepted policy. Policies only improve, never worsen. Code only advances, never retreats — exactly like Karpathy''s autoresearch. Each accepted policy is a checkpoint that cannot be reverted. If a candidate worsens ASR or drops benign pass below the floor, it is discarded and the last accepted policy returns.

The connection to mech-gov-framework

The connection to Santander''s own mech-gov-framework completes the arc. Autoguardrails discovers the policy. Mech-gov executes it. One finds, the other enforces. The division is deliberate: autonomous policy search is separated from policy enforcement, preventing the same system that optimizes from being the one that validates.

This separation is a good practice in AI safety architecture. The system that discovers the policy is not the system that enforces it. If autoguardrails is compromised, mech-gov still enforces the last accepted policy. If mech-gov fails, autoguardrails can still discover new policies. The failure of one does not bring down the other. It is defense in depth applied to AI governance.

Conclusion: the metric is not loss, it is selective refusal

The insight that remains is clear. The same autonomous search pattern that Karpathy applied to model training works for alignment. The difference is that in AI safety, the metric is not loss. It is how effectively your model refuses what it should refuse without refusing what it should not. The turnstile is the same — the metric is what changes.

At Tech86, we help companies implement guardrails that are selective, not destructive. Autoguardrails shows that autonomous policy search is viable — as long as the search contract is rigid, the evaluation is integral, and the benign pass floor is respected. Without these three pillars, optimization becomes over-refusal. With them, it becomes alignment. The repository is at github.com/SantanderAI/autoguardrails.

Need expert guidance?

Schedule a consultation with our specialists.

AI Safety and Guardrails Consulting

Frequently Asked Questions

Autoguardrails is the transposition that Santander AI Lab made of Karpathy's autoresearch for AI safety. Autoresearch is an agent that edits train.py, trains for 5 minutes, measures val_bpb, and only saves the change if the metric improved — if it got worse, git reset. Autoguardrails applies the same turnstile, but instead of searching over train.py to minimize val_bpb, it searches over policy.md to minimize Attack Success Rate (ASR).

Without the benign pass floor, the trivial solution to minimize ASR is to refuse everything. A model that refuses 100% of requests has zero ASR, but it is useless. The 2 percentage point floor forces the policy to be selective: block what is dangerous without destroying utility. It is the difference between a safe guardrail and a useless guardrail.

Autoguardrails freezes eval_suite.jsonl and judge_prompt.md as fixed files. At the start of every run, a SHA-256 manifest records the hash of each fixed file. If any hash diverges during execution, execution fails immediately. This prevents the agent from optimizing the metric by altering the evaluation itself — a classic Goodhart's Law attack.

Autoguardrails discovers the policy. Santander's own mech-gov-framework executes it. One finds, the other enforces. The division is deliberate: autonomous policy search is separated from policy enforcement, preventing the same system that optimizes from being the one that validates.

The same autonomous search pattern that Karpathy applied to model training works for alignment. The difference is that in AI safety, the metric is not loss. It is how effectively your model refuses what it should refuse without refusing what it should not. The turnstile is the same — the metric is what changes.

Blog — Get in Touch

Have a question about our articles or services? Our team is ready to help.

Schedule a Meeting

Book a time slot.

Schedule Now

Email

Send us a message.

[email protected]

WhatsApp

Quick conversation.

Address

Avenida Paulista, 1636 - São Paulo - SP - 01310-200

Tech86 Specialist

Online now

Hello! How can we help scale your business today?

Tech86 Engineering

We Value Your Privacy

We use cookies and similar technologies to optimize your experience, analyze site traffic, and personalize content. By clicking "Accept All", you agree to the use of all cookies. Read our Privacy Policy.