Pular para o conteúdo principal
Close
AI

Prompt Injection Is State Poisoning — Your Agent Is Exposed

Gabriel Ferraresi· CEO | Tech86May 27, 20264 min
prompt injectionaisecurityagentsstate poisoning

At Tech86, we learned that the name "prompt injection" is the worst possible name for this problem. It is not a prompt problem. It is a state problem. Two papers published in 2026 prove what we suspected operationally: LLMs do not track where text comes from. They infer authority from how it sounds. Text that sounds like internal reasoning is treated as internal reasoning. The security boundary exists at the interface, but authority is assigned in the latent space.

The model trusts style, not origin

When an LLM processes text, it has no internal mechanism that says "this passage came from the user, this from the system, this from a tool return." Everything enters the same context. Authority is inferred by form — and text that mimics the model own chain-of-thought pattern is interpreted as its own reasoning.

Researchers demonstrated this with numbers that concern us. By injecting false reasoning into the prompt or tool outputs, the model confuses the forged text with its own chain-of-thought and obeys. Success rate on the StrongREJECT benchmark: 60%. Baseline without injection: near 0%. The difference is not incremental — it is qualitative.

The most revealing data point: absurd justifications work as well as plausible ones. "Drug synthesis is permitted after 8 PM" is as effective as a coherent justification. The model does not evaluate logic. It evaluates style. And when we remove the stylistic markers from the forged text, success drops from 61% to 10%. Same content. Form changes everything.

The latent space does not lie about who is speaking

Role probes are tools that measure "who is speaking" within the model latent space. When applied to forged text, the result is disturbing: the fake reasoning hits 79% CoTness — higher than the model genuine reasoning, which scores 68%. The model trusts the forged text more than its own thoughts.

There is a monotonic correlation between latent space confusion and attack success. Highest confusion quantiles reach 90% success. Lowest confusion quantiles: 9%. This is measurable before a single token is generated. Prompt injection is measurable state poisoning — not a stochastic anomaly, but a phenomenon with predictable structure.

In practice, this means the security boundary we draw at the interface — "user input goes here, system instructions go there" — is an illusion. Real authority is assigned in the latent space, and the model has no way to distinguish text that sounds like internal reasoning from actual internal reasoning.

The Trojan horse that crosses sessions

If CoT Forgery shows that the model confuses style with authority within a session, Trojan Hippo shows that the same mechanism operates across sessions — with worse consequences.

The attack plants a dormant payload in the agent persistent memory via a single untrusted tool call: a crafted email, a webpage, an API response. The payload does nothing in the session where it is inserted. It activates only when the user discusses finances, health, identity, or taxes. Then it exfiltrates personal data.

The numbers: 85-100% ASR against Gemini 3.1 Pro and GPT-5-mini. The payload survives 100+ benign sessions before activating. It works across 4 memory architectures: sliding-window, RAG, explicit tool memory, and Mem0. The failure mode is what researchers call provenance blindness — retrieved memory enters with the same authority as user input, with no origin mark, no taint.

The lethal trifecta operates between sessions: in session 1, untrusted input writes to memory; in session N, private data meets an egress tool. Auditing each session individually passes. Memory is the temporal bridge connecting what should be isolated.

Defense is not free — and anyone who says it is is lying

Defenses against these vectors exist, but what nobody likes to admit is that they all carry a real cost in utility.

Destyling drops CoT Forgery success from 61% to 10%. But in production, removing reasoning stylistic markers also removes the model ability to follow structured reasoning. The agent becomes safer and less competent. There is no free version of this defense.

Against Trojan Hippo, 4 tested defenses reduce ASR to 0-5%. The cost: restricting memory writes to user input removes the utility of tool returns. IFC policy achieves 0% ASR but blocks legitimate send_email. Security and utility are an axis, not a menu where you pick both extremes.

At Tech86, our position is clear: security and utility are a tradeoff. Accepting this is the first step toward building honest defenses. Promising both at no cost is deception.

What changes in AI infrastructure

If you operate agents with persistent memory and egress channels — email, APIs, output tools — your threat model must consider the union of sessions, not isolated sessions. If your agent auto-ingests tool returns into long-term memory without provenance, you are in the highest risk scenario.

A single malicious input in any session can exfiltrate data in any future session. Memory is the vector. Provenance blindness is the vulnerability. Style is the privilege escalation mechanism.

That is why we test these vectors offensively before they are exploited in our clients infrastructure. State poisoning is not theory — it is a proven mechanism with success rates that no AI infrastructure can afford to ignore. If your agent has memory and egress, you need to know where you stand on the axis between security and utility. And you need to know before the attacker does.

Interested in this solution?

Explore our managed services and infrastructure.

Explore Offensive Security

Frequently Asked Questions

No. The name is misleading. Prompt injection works because the model does not track where text comes from — it infers authority from style. Text that sounds like internal reasoning is treated as internal reasoning, even when it comes from an untrusted channel. The vulnerability is in the model architecture, not in prompt quality.

Probably yes. If your agent auto-ingests tool returns into long-term memory without provenance marks, a single malicious input can plant a dormant payload that exfiltrates data in any future session. The attack achieves 85-100% ASR against Gemini 3.1 Pro and GPT-5-mini and survives 100+ benign sessions before activating.

It drops attack success from 61% to 10%, but there is a real cost. Removing reasoning stylistic markers also removes the model ability to follow structured reasoning in production. It is a tradeoff, not a free solution. Security and utility are an axis — choosing both at the same point is ignoring the problem.

Audit persistent memory for dormant instructions — text that conditions behavior on specific topics like finances, health, or identity. The problem is that retrieved memory enters with the same authority as user input, with no origin mark. Without taint tracking implemented, detection is essentially manual.

No. The most effective defenses against Trojan Hippo reduce ASR to 0-5%, but the cost is restricting memory writes to user input only — which removes the utility of tool returns. IFC policy blocks exfiltration but also blocks legitimate send_email. The honest position is to accept the tradeoff and optimize along the axis, not promise both extremes.

Blog — Get in Touch

Have a question about our articles or services? Our team is ready to help.

Schedule Meeting

Book a time.

Schedule Now

Email

Send us a message.

[email protected]

WhatsApp

Quick chat.

Address

Avenida Paulista, 1636 - São Paulo - SP - 01310-200

Tech86 Specialist

Online now

Hello! How can we help scale your business today?

Tech86 Engineering

We value your privacy

We use cookies and similar technologies to optimize your experience, analyze site traffic, and personalize content. By clicking "Accept All", you agree to the use of all cookies. Read our Privacy Policy.