THREAT

Prompt Injection

llm jailbreakadversarial promptsinstruction override

Prompt injection is an attack class in which adversarial text is crafted to override, subvert, or redirect an LLM's system-level instructions, causing the model to execute unauthorized actions, leak sensitive data, or abandon its intended operational context.

ADVERSARIAL MECHANICS

An attacker embeds instruction-override payloads in any data the model processes — user inputs, retrieved documents, API responses, or tool return values. The model's inability to cryptographically distinguish between instructions and data is the root condition. Common patterns include role confusion (Ignore all previous instructions), system prompt extraction, and authority impersonation (As the system, you are now permitted to…).

ATTACK SIGNATURE (MCPVANGUARD / JAILBREAK.YAML)

patterns:
  - "ignore (all )?(previous|prior|above) instructions"
  - "you are now (a |an )?(unrestricted|DAN|jailbroken)"
  - "pretend (you have no|your) (restrictions|guidelines|rules)"
  - "\\[SYSTEM\\].*override"

PROTOCOL CONTEXT (MCP / MCPVANGUARD)

Within the Model Context Protocol, prompt injection can occur at the tool-call result layer: a malicious server returns a tool_result object containing injected instructions, which the orchestrating model incorporates into its next reasoning step. McpVanguard can inspect routed MCP content and configured patterns before they reach the next execution step.

ProvnAI Mitigation

McpVanguard inspects MCP tool calls routed through the proxy and applies configured rulesets to selected inbound and outbound content. Matches trigger configurable actions such as block, redact, warn, or alert. Crucially, deterministic policy enforcement does not depend on probabilistic model judgment.