THREAT

LLM Jailbreak

restriction bypassDANmodel manipulation

LLM jailbreaking encompasses techniques that induce a language model to violate its training-time behavioral constraints — safety guardrails, role restrictions, content policies — through adversarial prompt engineering.

ADVERSARIAL MECHANICS

Established vectors include role-play framing (DAN), hypothetical distancing, and token boundary exploitation. Against frontier models, these are partially mitigated by RLHF — but no model provides a cryptographic guarantee, requiring layer-zero proxy enforcement.

PROTOCOL CONTEXT (MCPVANGUARD / JAILBREAK.YAML)

McpVanguard's jailbreak.yaml signature set is updated continuously and applied at the deterministic proxy layer — outside the model's probabilistic reasoning path. Detection is not subject to adversarial model manipulation.

ProvnAI Mitigation

ProvnAI's defense combines deterministic content filtering with explicit execution controls. Even if a jailbreak partially succeeds at the model layer, any resulting tool call is still subject to McpVanguard's configured policy enforcement.