THREAT
LLM Jailbreak
restriction bypassDANmodel manipulation
LLM jailbreaking encompasses techniques that induce a language model to violate its training-time behavioral constraints — safety guardrails, role restrictions, content policies — through adversarial prompt engineering.
ADVERSARIAL MECHANICS
Established vectors include role-play framing (DAN), hypothetical distancing, and token boundary exploitation. Against frontier models, these are partially mitigated by RLHF — but no model provides a cryptographic guarantee, requiring layer-zero proxy enforcement.
PROTOCOL CONTEXT (MCPVANGUARD / JAILBREAK.YAML)
McpVanguard's jailbreak.yaml signature set is updated continuously and applied at the deterministic proxy layer — outside the model's probabilistic reasoning path. Detection is not subject to adversarial model manipulation.
ProvnAI Mitigation
ProvnAI's defense combines deterministic content filtering with identity-bound accountability. Even if a jailbreak partially succeeds at the model layer, any resulting tool call must pass through McpVanguard's authority validation.