Skip to content

Training AI Models Early for Adversarial Robustness

by Sander Schulhof on December 21, 2025

The fundamental problem with AI security is that current guardrail approaches are fundamentally insufficient against determined attackers, with the only reason major attacks haven't occurred being the early stage of AI adoption.

Why AI Security Is Different from Traditional Security

  • "You can patch a bug but you can't patch a brain"

    • In traditional software: When you find and patch a bug, you can be 99.99% sure the bug is solved
    • In AI systems: When you find and "fix" a vulnerability, you can be 99.99% sure the problem is still there
    • The attack surface is effectively infinite - "one followed by a million zeros" possible attack prompts
  • Current AI security measures fail because:

    • Guardrails (AI systems that check inputs/outputs) can always be circumvented by determined attackers
    • Automated red teaming works too well - it will always find vulnerabilities in any transformer-based model
    • Prompt-based defenses (adding instructions like "don't do harmful things") are the weakest defense type

The Real Risk Landscape

  • The risk increases dramatically with AI agents that can take actions:

    • Simple chatbots: Limited damage potential (mostly reputational)
    • AI agents with permissions: Can be tricked into taking harmful actions (sending emails, modifying databases)
    • AI-powered browsers: Can be tricked into exfiltrating user data (as seen in the Comet browser incident)
    • AI-powered robots: Could potentially cause physical harm if compromised
  • Attack vectors include:

    • Direct jailbreaking: User directly tricks the model into harmful outputs
    • Prompt injection: Attacker gets model to ignore developer instructions
    • Indirect prompt injection: External content (emails, websites) containing hidden instructions that trick agents

Practical Security Approaches

  • Focus on classical cybersecurity principles:

    • Proper permissioning - limit what AI systems can access and do
    • Containerization - isolate AI execution environments
    • Monitoring and logging - track all AI inputs and outputs
  • Consider implementing Camel (from Google):

    • Analyzes user requests and restricts agent permissions to only what's needed
    • Example: If user only asks to summarize emails, give read-only permissions (no sending)
    • Works well for single-purpose tasks but less effective for complex multi-permission scenarios
  • Prioritize education and awareness:

    • Ensure teams understand AI vulnerabilities
    • Hire people at the intersection of AI research and cybersecurity
    • Recognize when AI systems don't need advanced security (read-only chatbots)

The Future of AI Security

  • No meaningful progress has been made on solving adversarial robustness in years

  • Potential approaches include:

    • Adversarial training earlier in the model development process
    • New model architectures that are inherently more robust
    • More realistic evaluation methods (adaptive rather than static)
  • Market prediction:

    • A correction in the AI security industry as companies realize guardrails don't work
    • Increasing incidents as more powerful agents are deployed without adequate protection
    • Growing importance of professionals who understand both AI and cybersecurity