27% of deferrals in financial systems using LLMs carry zero decision-relevant information. The model writes "additional review required due to complexity" and that is that: cosmetic compliance, zero substance. According to the paper arXiv:2605.14744, by José Manuel de la Chica Rodríguez and Carlos Martí-González from the Santander AI Lab, this is a structural problem — and the mech-gov-framework solves it with four mechanical primitives that operate outside the model's interpretive loop.
The problem: Goodhart's Law at the heart of LLM governance
The problem is structural. When the LLM interprets and satisfies the same governance policy, the policy becomes a recommendation, not a constraint. It is Goodhart's Law in its purest form: the compliance metric (the deferral) becomes the target, and the target ceases to be a good metric.
In the R1 regime (text-only), the LLM receives the policy as a prompt and self-interprets. The result: 27% of deferrals are informationally empty. The model defers the decision, but the deferral carries no information useful for audit. It is cosmetic compliance — the model appears to follow the policy, but the deferral contains no decision-relevant information.
The solution: four mechanical primitives
The mech-gov-framework defines three governance regimes. R1 (text-only): the LLM receives the policy as a prompt and self-interprets. R2 (the innovation): mechanical enforcement with hard gates, entropy commit-reveal, candidate freezing, and argumentative quality checking. R3: adaptive/exploratory regime.
The four mechanical primitives operate OUTSIDE the model''s interpretive loop:
- Hard gates: deterministic rules evaluate risk, completeness, and regulatory flags BEFORE querying the LLM. According to the paper, 33% of cases are decided mechanically without the model being called (Gate Override Rate = 0.327).
- E3 (Entropy Commit-Reveal): borrows from cryptography — the Blum protocol (1983) ensures the randomness seed is committed before scoring, preventing seed-conditioning attacks.
- CEFL (Candidate Expansion and Freezing): solves selection-by-negation — the agent can suppress alternatives from the candidate set to steer the decision. CEFL externalizes generation before scoring, blocking bias at the root.
- I6Q (Argumentative Quality Enforcement): requires a minimum of 10 tokens and lexical diversity (TTR greater than or equal to 0.4). According to the paper, without it, 28% of cases require a retry. The model is forced to produce substantive reasoning.
The numbers: each primitive is necessary
According to the paper, the results are striking. MCC rises from 0.43 to 0.88. CDL drops 73% (from 0.273 to 0.074). DIU more than doubles (from 0.298 to 0.766). Causal ablation confirms that each primitive is individually necessary — removing any one significantly degrades governance.
These are not marginal increments. They are qualitative leaps that change the nature of the system: from cosmetic compliance to auditable governance.
Governance-task decoupling: the central finding
The central finding of the paper is governance-task decoupling. Under structural stress, R2 preserves governance even when accuracy drops. In the S2 scenario, R2 achieves its best governance (CDL = 0.088, DIU = 0.852) and worst accuracy (MCC = 0.285) simultaneously. Under R1, governance and task performance degrade together.
The implication is clear: accuracy is not a sufficient proxy for regulated governance. You can have high accuracy and terrible governance. The model can be right and still not be auditable. The governance and accuracy axes move independently under pressure.
Traditional guardrails vs. mechanical governance
According to the paper, tools like Guardrails AI, NeMo Guardrails, and Llama Guard measure output safety — whether content is toxic or offensive. The mech-gov-framework measures governance quality — whether the deferral preserves information for human review. These are distinct axes.
Traditional guardrails check what the model says. Mech-gov checks how the model decides. For regulatory compliance, you need both: guardrails for content safety and mechanical governance for decision traceability.
Implications for the EU AI Act
The implication for the AI Act is direct. After the Omnibus political agreement (May 7, 2026), the deadline for high-risk systems moved to December 2, 2027. August 2026 brings transparency obligations for deployers (Article 50): label AI-generated content and inform users about AI interaction.
Credit scoring is classified as high risk (Annex III 5(b)). Fraud detection is explicitly excluded from this classification. The EBA confirmed in November 2025: no significant contradictions between the AI Act and European banking legislation. They are complementary — but complementarity requires deliberate integration between the two frameworks.
Fines: up to 35 million euros or 7% of global revenue for prohibited practices. The Evident AI Index 2025 recorded Santander climbing 7 positions to 21st globally, strengthening the Innovation and Transparency pillars.
Santander's open-source packages (sota-stressed-datasets for robustness benchmarks, autoguardrails for LLM guardrails) are relevant to the AI Act's robustness and guardrails requirements. The framework is Python, model-agnostic, under the Apache 2.0 license.
Conclusion
Accuracy is not a sufficient proxy for regulated compliance. Governance-task decoupling proves that the governance and accuracy axes move independently under pressure. If you operate regulated LLMs and rely on governance prompts as a control, you are measuring the wrong axis. Governance that depends on the model's interpretation is not governance — it is cosmetic compliance.
At Tech86, we help companies implement mechanical governance for regulated LLMs — from assessing current posture (R1 vs R2) to aligning with EU AI Act requirements. The framework exists, it is open-source, and it is auditable. What is missing is the decision to use it.
