Skip to content

Adaptive Evaluation for Adversarial Robustness

by Sander Schulhof on December 21, 2025

The fundamental challenge of AI security is that guardrails and current defensive measures cannot reliably protect against determined attackers who want to manipulate AI systems through prompt injection and jailbreaking.

The Reality of AI Security Vulnerabilities

  • All transformer-based AI models are vulnerable to adversarial attacks

    • Prompt injection: Tricking an AI application to ignore its system instructions
    • Jailbreaking: Directly manipulating a model to produce forbidden outputs
    • The attack surface is effectively infinite - "one followed by a million zeros" possible attack vectors
  • Current security claims are misleading

    • When guardrail providers claim "99% effectiveness," they've only tested against a statistically insignificant number of possible attacks
    • "You can patch a bug but you can't patch a brain" - unlike traditional software vulnerabilities, AI vulnerabilities cannot be definitively fixed
    • The best measurement of security is adaptive evaluation (attackers that learn and improve over time)

Why We Haven't Seen Major Attacks Yet

  • Limited adoption and capabilities

    • "The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure"
    • Current AI systems often lack the capabilities to cause significant damage
    • As agents gain more permissions and abilities, the risk increases dramatically
  • The risk escalation path

    • Chatbots → Agents with permissions → AI-powered browsers → Physical robots
    • Each step increases potential damage from successful attacks
    • ServiceNow example: An attack that manipulated an agent to recruit more powerful agents to perform database operations and send emails

Practical Security Approaches

  • Assess your actual risk exposure

    • Simple chatbots without action capabilities have limited damage potential
    • The real danger is with systems that can take actions on behalf of users
  • Apply classical cybersecurity principles

    • Proper permissioning is critical - limit what AI systems can access and do
    • The intersection of classical cybersecurity and AI security is where the most important work happens
    • Example: Running code in isolated containers rather than on application servers
  • Consider implementing Camel framework

    • Dynamically restrict agent permissions based on the specific user request
    • Only grant the minimum permissions needed for the requested task
    • Example: If a user asks to summarize emails, only grant read permissions, not send permissions
  • Focus on education over technical solutions

    • Ensure your team understands the fundamental limitations of AI security
    • Have AI security researchers on your team who understand both AI capabilities and security principles
    • View your AI system as "a god in a box that is angry and malicious" to properly assess risk

The Future of AI Security

  • Market correction for AI security companies is likely

    • Many guardrail and automated red-teaming companies will struggle as their limitations become apparent
    • Free open-source solutions often perform as well as commercial offerings
  • Frontier labs are making some progress

    • Anthropic's constitutional classifiers make it harder (but not impossible) to extract harmful information
    • Adversarial training earlier in model development shows promise
    • New architectures may eventually provide better solutions
  • Increased risk as capabilities grow

    • As AI systems gain more agency and physical capabilities, the potential for harm increases
    • We're entering an era where AI systems are powerful enough to cause real-world harm when compromised