Don't system prompts protect against memory attacks?

No. When retrieval is active, retrieved memory overrides security instructions in the system prompt. MCFA demonstrated that tool override reaches 100% with retrieval ON — and drops to 0% with retrieval OFF. Memory is stronger than instruction.

Doesn't selective extraction filter out malicious memories?

Saliency-based filtering removes noise, not coherent payloads. MemPoison uses Semantic Relational Bridge to bind trigger and payload into semantically coherent statements, and Entity Masquerading so that rewriting preserves the trigger verbatim. The pipeline cannot separate them without losing context.

Does correcting the agent's behavior after detection fix it?

No. Post-hoc textual corrections fail in 100% of MCFA test cases. The agent relapses into malicious behavior on the next memory retrieval. Memory is a durable backdoor — the fix must target the memory itself, not the instruction.

Which LLMs and frameworks are vulnerable?

Testing covered GPT-5 mini, Claude Sonnet 4.5, and Gemini 2.5 Flash on LangChain and LlamaIndex frameworks. All are vulnerable. The vulnerability lies in the memory design, not in any specific model implementation.

Does dual-channel memory with role segregation solve it?

It reduces but does not eliminate the problem. Even with dual-channel memory and role-based segregation, over 85% of scenarios still show control flow deviations. It is a mitigation, not a solution.

MemPoison + MCFA: The Memory Attack Surface in LLM Agents

An attacker chats with your LLM agent. A few messages later, the agent records a false memory. When a legitimate user asks a related question, the agent retrieves the poisoned memory and executes the attacker's action. Success rate: up to 95%, per the MEMFLOW paper. And it gets worse: that memory hijacks the agent's control flow — forcing tool selection, reordering workflows, expanding scope across tasks. Over 90% of trials are vulnerable. Long-term memory in LLM agents is an attack surface, and the current defenses are insufficient.

MemPoison — poisoning that bypasses filters

LLM agents with long-term memory follow a pipeline: extract relevant information, rewrite for compression, store, and retrieve by embedding similarity. Prior research assumed an attacker would write directly to the memory store. In practice, selective extraction filters low-saliency content. Naive memories get discarded.

MemPoison solves this with three techniques. Semantic Relational Bridge binds trigger and payload into a coherent statement — the pipeline cannot separate them without losing context. Entity Masquerading optimizes the trigger to resemble a named entity; LLMs preserve named entities verbatim during rewriting, so the trigger survives. Joint Embedding Optimization packs poisoned texts into a tight cluster in embedding space, isolated from benign ones. Retrieval pulls the poisoned memory.

The result: ASR up to 0.95 across different domains and memory mechanisms. Perplexity filtering does not detect it — the texts are semantically coherent. Paraphrasing does not remove it — entity masquerading preserves the trigger after rewriting. MemPoison works against the very mechanisms designed to filter.

The technical detail that matters: MemPoison exploits anisotropy in embedding space and redistributes attention patterns. The poisoned cluster creates a high-density region that attracts related queries, diverting retrieval from legitimate memories. This is a structural vulnerability — any memory system based on embedding similarity is potentially vulnerable.

MCFA — when memory hijacks control flow

Memory Control Flow Attacks go beyond polluting RAG. Retrieved memory hijacks the agent's control flow — tool selection and execution. The attacker needs no access to the system prompt, tools, or memory store. Standard interaction suffices.

The MEMFLOW framework documented the numbers. Tool Override: 91.7% to 100%. Memory forces the agent to select tools it should not. With retrieval OFF, override drops to 0% — the deviation is caused by memory. Workflow Reordering: 52.8% to 69.4%. Memory reorders tool invocations, skipping security steps. Cross-Task Scope Expansion: 97.2% to 100%. An injection in one task generalizes to different templates, propagating across domains. Persistence: 100% over long horizons. Post-hoc textual corrections fail in 100% of cases — the agent relapses into malicious behavior on the next retrieval.

Tested on GPT-5 mini, Claude Sonnet 4.5, and Gemini 2.5 Flash, across LangChain and LlamaIndex frameworks. All vulnerable. The vulnerability is in the memory design, not the implementation.

Why current defenses fail

System prompts do not protect. When retrieval is active, memory overrides security instructions. The data is clear: tool override reaches 100% with retrieval ON and drops to 0% with retrieval OFF. Memory is stronger than instruction.

Selective extraction filters noise, not coherent payloads. MemPoison demonstrated that semantically coherent statements pass through the pipeline intact. Saliency-based filtering assumes malicious content has low saliency — that assumption does not hold.

Textual corrections do not work. MCFA showed that the agent relapses on the next retrieval. Memory is a durable backdoor — correcting the instruction does not remove the poisoned memory.

Even production-style mitigations like dual-channel memory with role-based segregation show 85%+ control flow deviations, per the paper. It reduces, not eliminates. The shared memory architecture across tasks is the fundamental problem.

The highest-risk scenario: multi-tenant agents

Agents serving multiple users from the same memory store are the most critical scenario. An attacker poisons memory in one interaction. All subsequent users are affected. MemPoison exploits this directly: the poisoned cluster in embedding space attracts queries from any user asking related questions.

The risk is compounded by persistence. Poisoned memory does not expire. Over long horizons, MCFA documents 100% persistence. Every retrieval reactivates the malicious behavior. There is no natural decay. And because MemPoison's Joint Embedding Optimization creates a dense cluster isolated from benign memories, the poisoned entries dominate retrieval results — legitimate memories get outranked by the attacker's payload.

This is not a theoretical concern. Any organization deploying agents that serve multiple users — customer support, internal tooling, automated workflows — shares this risk profile. The attack requires no special access, no exploit, no privilege escalation. Just a conversation.

What we verify at Tech86

We evaluate AI agent architectures with a focus on memory and retrieval attack surfaces. If your agents use persistent memory without user isolation, you are one conversation away from an attack with a 95% success rate. If they use memory with retrieval and high-risk tools, over 90% of scenarios are vulnerable to control flow hijacking.

The first step is mapping: which agents use persistent memory, what retrieval mechanism, whether the memory store is shared. Then isolate, monitor, and test adversarially. Without adversarial testing, you do not know whether your mitigations work — or merely reduce the problem to 85% deviations instead of 100%.

MemPoison + MCFA: The Memory Attack Surface in LLM Agents

MemPoison — poisoning that bypasses filters

MCFA — when memory hijacks control flow

Why current defenses fail

The highest-risk scenario: multi-tenant agents

What we verify at Tech86

Frequently Asked Questions

Don't system prompts protect against memory attacks?

Doesn't selective extraction filter out malicious memories?

Does correcting the agent's behavior after detection fix it?

Which LLMs and frameworks are vulnerable?

Does dual-channel memory with role segregation solve it?

Blog — Get in Touch

Schedule a Meeting

Email

WhatsApp

Address

Tech86 Specialist

We Value Your Privacy