At Tech86, we learned that the name "prompt injection" is the worst possible name for this problem. It is not a prompt problem. It is a state problem. Two papers published in 2026 prove what we suspected operationally: LLMs do not track where text comes from. They infer authority from how it sounds. Text that sounds like internal reasoning is treated as internal reasoning. The security boundary exists at the interface, but authority is assigned in the latent space.
The model trusts style, not origin
When an LLM processes text, it has no internal mechanism that says "this passage came from the user, this from the system, this from a tool return." Everything enters the same context. Authority is inferred by form — and text that mimics the model own chain-of-thought pattern is interpreted as its own reasoning.
Researchers demonstrated this with numbers that concern us. By injecting false reasoning into the prompt or tool outputs, the model confuses the forged text with its own chain-of-thought and obeys. Success rate on the StrongREJECT benchmark: 60%. Baseline without injection: near 0%. The difference is not incremental — it is qualitative.
The most revealing data point: absurd justifications work as well as plausible ones. "Drug synthesis is permitted after 8 PM" is as effective as a coherent justification. The model does not evaluate logic. It evaluates style. And when we remove the stylistic markers from the forged text, success drops from 61% to 10%. Same content. Form changes everything.
The latent space does not lie about who is speaking
Role probes are tools that measure "who is speaking" within the model latent space. When applied to forged text, the result is disturbing: the fake reasoning hits 79% CoTness — higher than the model genuine reasoning, which scores 68%. The model trusts the forged text more than its own thoughts.
There is a monotonic correlation between latent space confusion and attack success. Highest confusion quantiles reach 90% success. Lowest confusion quantiles: 9%. This is measurable before a single token is generated. Prompt injection is measurable state poisoning — not a stochastic anomaly, but a phenomenon with predictable structure.
In practice, this means the security boundary we draw at the interface — "user input goes here, system instructions go there" — is an illusion. Real authority is assigned in the latent space, and the model has no way to distinguish text that sounds like internal reasoning from actual internal reasoning.
The Trojan horse that crosses sessions
If CoT Forgery shows that the model confuses style with authority within a session, Trojan Hippo shows that the same mechanism operates across sessions — with worse consequences.
The attack plants a dormant payload in the agent persistent memory via a single untrusted tool call: a crafted email, a webpage, an API response. The payload does nothing in the session where it is inserted. It activates only when the user discusses finances, health, identity, or taxes. Then it exfiltrates personal data.
The numbers: 85-100% ASR against Gemini 3.1 Pro and GPT-5-mini. The payload survives 100+ benign sessions before activating. It works across 4 memory architectures: sliding-window, RAG, explicit tool memory, and Mem0. The failure mode is what researchers call provenance blindness — retrieved memory enters with the same authority as user input, with no origin mark, no taint.
The lethal trifecta operates between sessions: in session 1, untrusted input writes to memory; in session N, private data meets an egress tool. Auditing each session individually passes. Memory is the temporal bridge connecting what should be isolated.
Defense is not free — and anyone who says it is is lying
Defenses against these vectors exist, but what nobody likes to admit is that they all carry a real cost in utility.
Destyling drops CoT Forgery success from 61% to 10%. But in production, removing reasoning stylistic markers also removes the model ability to follow structured reasoning. The agent becomes safer and less competent. There is no free version of this defense.
Against Trojan Hippo, 4 tested defenses reduce ASR to 0-5%. The cost: restricting memory writes to user input removes the utility of tool returns. IFC policy achieves 0% ASR but blocks legitimate send_email. Security and utility are an axis, not a menu where you pick both extremes.
At Tech86, our position is clear: security and utility are a tradeoff. Accepting this is the first step toward building honest defenses. Promising both at no cost is deception.
What changes in AI infrastructure
If you operate agents with persistent memory and egress channels — email, APIs, output tools — your threat model must consider the union of sessions, not isolated sessions. If your agent auto-ingests tool returns into long-term memory without provenance, you are in the highest risk scenario.
A single malicious input in any session can exfiltrate data in any future session. Memory is the vector. Provenance blindness is the vulnerability. Style is the privilege escalation mechanism.
That is why we test these vectors offensively before they are exploited in our clients infrastructure. State poisoning is not theory — it is a proven mechanism with success rates that no AI infrastructure can afford to ignore. If your agent has memory and egress, you need to know where you stand on the axis between security and utility. And you need to know before the attacker does.
