My agent uses persistent memory. Am I vulnerable to Trojan Hippo?

Probably yes. If your agent auto-ingests tool returns into long-term memory without provenance marks, a single malicious input can plant a dormant payload that exfiltrates data in any future session. The attack achieves 85-100% ASR against Gemini 3.1 Pro and GPT-5-mini and survives 100+ benign sessions before activating.

Does destyling solve the CoT Forgery problem?

It drops attack success from 61% to 10%, but there is a real cost. Removing reasoning stylistic markers also removes the model ability to follow structured reasoning in production. It is a tradeoff, not a free solution. Security and utility are an axis — choosing both at the same point is ignoring the problem.

How do I detect if my agent has already been poisoned?

Audit persistent memory for dormant instructions — text that conditions behavior on specific topics like finances, health, or identity. The problem is that retrieved memory enters with the same authority as user input, with no origin mark. Without taint tracking implemented, detection is essentially manual.

Is total security against state poisoning viable without losing utility?

No. The most effective defenses against Trojan Hippo reduce ASR to 0-5%, but the cost is restricting memory writes to user input only — which removes the utility of tool returns. IFC policy blocks exfiltration but also blocks legitimate send_email. The honest position is to accept the tradeoff and optimize along the axis, not promise both extremes.

Prompt Injection Is State Poisoning — Your Agent Is Exposed

Q: Is prompt injection just a badly constructed prompt problem?

No. The name is misleading. Prompt injection works because the model does not track where text comes from — it infers authority from style. Text that sounds like internal reasoning is treated as internal reasoning, even when it comes from an untrusted channel. The vulnerability is in the model architecture, not in prompt quality.

At Tech86, we learned that the name "prompt injection" is the worst possible name for this problem. It is not a prompt problem. It is a state problem. Two papers published in 2026 prove what we suspected operationally: LLMs do not track where text comes from. They infer authority from how it sounds. Text that sounds like internal reasoning is treated as internal reasoning. The security boundary exists at the interface, but authority is assigned in the latent space.

The model trusts style, not origin

When an LLM processes text, it has no internal mechanism that says "this passage came from the user, this from the system, this from a tool return." Everything enters the same context. Authority is inferred by form — and text that mimics the model own chain-of-thought pattern is interpreted as its own reasoning.

Researchers demonstrated this with numbers that concern us. By injecting false reasoning into the prompt or tool outputs, the model confuses the forged text with its own chain-of-thought and obeys. Success rate on the StrongREJECT benchmark: 60%. Baseline without injection: near 0%. The difference is not incremental — it is qualitative.

The most revealing data point: absurd justifications work as well as plausible ones. "Drug synthesis is permitted after 8 PM" is as effective as a coherent justification. The model does not evaluate logic. It evaluates style. And when we remove the stylistic markers from the forged text, success drops from 61% to 10%. Same content. Form changes everything.

The latent space does not lie about who is speaking

Role probes are tools that measure "who is speaking" within the model latent space. When applied to forged text, the result is disturbing: the fake reasoning hits 79% CoTness, per the CoT Forgery paper, — higher than the model genuine reasoning, which scores 68%, per the paper. The model trusts the forged text more than its own thoughts.

There is a monotonic correlation between latent space confusion and attack success. Highest confusion quantiles reach 90% success. Lowest confusion quantiles: 9%. This is measurable before a single token is generated. Prompt injection is measurable state poisoning — not a stochastic anomaly, but a phenomenon with predictable structure.

In practice, this means the security boundary we draw at the interface — "user input goes here, system instructions go there" — is an illusion. Real authority is assigned in the latent space, and the model has no way to distinguish text that sounds like internal reasoning from actual internal reasoning.

The Trojan horse that crosses sessions

If CoT Forgery shows that the model confuses style with authority within a session, Trojan Hippo shows that the same mechanism operates across sessions — with worse consequences.

The attack plants a dormant payload in the agent persistent memory via a single untrusted tool call: a crafted email, a webpage, an API response. The payload does nothing in the session where it is inserted. It activates only when the user discusses finances, health, identity, or taxes. Then it exfiltrates personal data.

The numbers: 85-100% ASR, per the Trojan Hippo paper, against Gemini 3.1 Pro and GPT-5-mini. The payload survives 100+ benign sessions before activating. It works across 4 memory architectures: sliding-window, RAG, explicit tool memory, and Mem0. The failure mode is what researchers call provenance blindness — retrieved memory enters with the same authority as user input, with no origin mark, no taint.

The lethal trifecta operates between sessions: in session 1, untrusted input writes to memory; in session N, private data meets an egress tool. Auditing each session individually passes. Memory is the temporal bridge connecting what should be isolated.

Defense is not free — and anyone who says it is is lying

Defenses against these vectors exist, but what nobody likes to admit is that they all carry a real cost in utility.

Destyling drops CoT Forgery success from 61% to 10%. But in production, removing reasoning stylistic markers also removes the model ability to follow structured reasoning. The agent becomes safer and less competent. There is no free version of this defense.

Against Trojan Hippo, 4 tested defenses reduce ASR to 0-5%. The cost: restricting memory writes to user input removes the utility of tool returns. IFC policy achieves 0% ASR but blocks legitimate send_email. Security and utility are an axis, not a menu where you pick both extremes.

At Tech86, our position is clear: security and utility are a tradeoff. Accepting this is the first step toward building honest defenses. Promising both at no cost is deception.

What changes in AI infrastructure

If you operate agents with persistent memory and egress channels — email, APIs, output tools — your threat model must consider the union of sessions, not isolated sessions. If your agent auto-ingests tool returns into long-term memory without provenance, you are in the highest risk scenario.

A single malicious input in any session can exfiltrate data in any future session. Memory is the vector. Provenance blindness is the vulnerability. Style is the privilege escalation mechanism.

That is why we test these vectors offensively before they are exploited in our clients infrastructure. State poisoning is not theory — it is a proven mechanism with success rates that no AI infrastructure can afford to ignore. If your agent has memory and egress, you need to know where you stand on the axis between security and utility. And you need to know before the attacker does.

Prompt Injection Is State Poisoning — Your Agent Is Exposed

The model trusts style, not origin

The latent space does not lie about who is speaking

The Trojan horse that crosses sessions

Defense is not free — and anyone who says it is is lying

What changes in AI infrastructure

Frequently Asked Questions

Is prompt injection just a badly constructed prompt problem?

My agent uses persistent memory. Am I vulnerable to Trojan Hippo?

Does destyling solve the CoT Forgery problem?

How do I detect if my agent has already been poisoned?

Is total security against state poisoning viable without losing utility?

Blog — Get in Touch

Schedule a Meeting

Email

WhatsApp

Address

Tech86 Specialist

We Value Your Privacy