CanSecWest 2025 _newtype

Presentations

Robert Yuen 4/11/25 Robert Yuen 4/11/25

Role Reversal: Exploiting AI Moderation Rules as Attack Vectors.

The rapid deployment of frontier large language models (LLMs) agents across applications, impacting sectors projected by McKinsey to potentially add $4.4 trillion to the global economy, has mandated the implementation of sophisticated safety protocols and content moderation rules. However, documented attack success rates (ASR) reaching as high as 0.99 against models like ChatGPT and GPT-4 using universal adversarial triggers (Shen et al., 2023) underscore a critical vulnerability: the safety mechanisms themselves. While significant effort is invested in patching vulnerabilities, this presentation argues that the rules, filters, and patched protocols often become primary targets, creating a persistent and evolving threat landscape. This risk is amplified by a lowered barrier to entry for adversarial actors and the emergence of new attack vectors inherent to LLM reasoning capabilities. This presentation focuses on showcasing documented instances where security protocols and moderation rules, specifically designed to counter known LLM vulnerabilities, are paradoxically turned into attack vectors.

CanSecWest 2025 _newtype

Presentations

Role Reversal: Exploiting AI Moderation Rules as Attack Vectors.

dragostech.com inc.

Sponsors