AI Security Researcher on Team Validates Model Capabilities

AI guardrails fundamentally don't work against determined attackers, and this creates a dangerous false sense of security as AI systems gain more agency in the world.

The AI security industry has a critical problem: guardrails and defensive measures simply don't prevent determined attackers from manipulating AI systems. As Sander Schulhof explains, "If someone is determined enough to trick GPT-5, they're gonna deal with that guardrail no problem." The attack surface is effectively infinite—the number of possible attacks against a language model is equivalent to the number of possible prompts, which is "one followed by a million zeros." This means even if a guardrail claims to catch 99% of attacks, there are still effectively infinite attacks that will succeed.

The risk is currently masked by limited adoption and capabilities. The only reason we haven't seen massive attacks yet is how early the adoption is, not because systems are secure. As AI systems gain more agency—the ability to take actions like sending emails, updating databases, or controlling robots—the potential damage from prompt injection attacks increases dramatically. Unlike traditional software bugs that can be patched with near certainty, AI vulnerabilities persist: "You can patch a bug but you can't patch a brain. If you find some bug in your software and you go and patch it, you can be 99.99% sure that bug is solved. Try to do that in your AI system—you can be 99.99% sure that the problem is still there."

The most effective approach combines classical cybersecurity with AI expertise. Rather than investing in guardrails that create false confidence, organizations should focus on proper permissioning and containment. The CAMEL framework from Google represents a promising approach—it analyzes what permissions an AI system actually needs for a specific task and restricts capabilities accordingly. For example, if a user only asks for a summary of emails, the system would receive read-only permissions, preventing it from sending emails even if tricked.

For leaders implementing AI systems, the implications are clear: first, assess whether your AI system can actually cause damage. Simple chatbots that can't take actions present minimal risk beyond reputational harm. Second, ensure proper permissioning—any data an AI has access to can be leaked, and any actions it can take can be triggered by attackers. Third, consider implementing frameworks like CAMEL that limit permissions based on the specific task. Finally, place an AI security researcher on your team who understands both AI capabilities and security principles—this intersection is where the most valuable protection will come from.

The most dangerous misconception is believing that because you've implemented guardrails, your system is secure. This false confidence may lead to deploying systems with dangerous capabilities before proper security measures are in place. As AI systems become more powerful and widespread, the security challenges will only increase, requiring a fundamental rethinking of how we approach AI safety.