THREAT
Indirect Prompt Injection
retrieval-augmented attackexternal content injectionRAG poisoning
Indirect prompt injection is a vector in which adversarial instructions are embedded in external content — web pages, documents, emails, database records — that an AI agent retrieves and processes autonomously.
ADVERSARIAL MECHANICS
An attacker places invisible or camouflaged instruction text in a publicly accessible resource — a white-paper footnote, an HTML comment, or a shared document. When an agent retrieves this resource via a tool call (e.g., browse_web), the payload enters the context window. Encoding variants (Unicode homoglyphs, zero-width characters) are used to evade naive matching.
PROTOCOL CONTEXT (MCP)
In MCP-based systems, this attack targets tool_result payloads from resource-access tools. Because MCP tools operate with elevated trust by design, injections arriving via tool_result are processed with high fidelity by the model.
ProvnAI Mitigation
McpVanguard applies content-layer scanning to all tool_result objects, not only user-originated inputs. The jailbreak.yaml ruleset and supplementary retrieval-context signatures detect injection patterns in retrieved content before they reach the model context.