Back to Glossary
THREAT

Prompt Injection

llm jailbreakadversarial promptsinstruction override

Prompt injection is an attack class in which adversarial text is crafted to override, subvert, or redirect an LLM's system-level instructions, causing the model to execute unauthorized actions, leak sensitive data, or abandon its intended operational context.

ADVERSARIAL MECHANICS

An attacker embeds instruction-override payloads in any data the model processes — user inputs, retrieved documents, API responses, or tool return values. The model's inability to cryptographically distinguish between instructions and data is the root condition. Common patterns include role confusion (Ignore all previous instructions), system prompt extraction, and authority impersonation (As the system, you are now permitted to…).

ATTACK SIGNATURE (MCPVANGUARD / JAILBREAK.YAML)

patterns:
  - "ignore (all )?(previous|prior|above) instructions"
  - "you are now (a |an )?(unrestricted|DAN|jailbroken)"
  - "pretend (you have no|your) (restrictions|guidelines|rules)"
  - "\\[SYSTEM\\].*override"

PROTOCOL CONTEXT (MCP / MCPVANGUARD)

Within the Model Context Protocol, prompt injection can occur at the tool-call result layer: a malicious server returns a tool_result object containing injected instructions, which the orchestrating model incorporates into its next reasoning step. McpVanguard's deterministic proxy inspects all inbound tool_result content prior to model consumption.

ProvnAI Mitigation

McpVanguard intercepts all traffic at the MCP boundary and applies the jailbreak.yaml ruleset to both inbound and outbound content. Matches trigger configurable actions: block, redact, or alert. Crucially, this filtering is deterministic — no probabilistic model judgment is involved.