Customers Push Model Boundaries Through Early Access
by Mike Krieger on June 5, 2025
Anthropic's Early Access Program: Using Customer Testing to Drive Model Improvements
Anthropic has discovered that one of the most effective ways to improve their AI models is through an expanded early access program that lets customers push their models to the breaking point. This approach has become a critical part of their development cycle, creating a virtuous feedback loop between real-world usage and model improvements.
Mike Krieger explains that while internal benchmarks like SuiteBench and TauBench are valuable, what ultimately matters are customer-specific use cases - what he jokingly calls "CursorBench," "ManusBench," or "HarveyBench" for their different partners. These real-world applications reveal limitations and opportunities that internal testing might miss.
The most valuable early access partners are those willing to "build at the edge of capabilities" - companies that try to use the models for challenging tasks, hit walls with current versions, and then are delighted when new model releases suddenly make previously impossible tasks feasible. As Krieger notes, "Those companies were trying it beforehand and then hitting a wall and being like 'oh the models are like almost good enough'... those are the companies that I think continuously are the ones where I'm like 'yep, they get it.'"
The process involves several key components:
- Pushing models to their limits to identify breaking points
- Creating repeatable evaluation processes to measure improvements
- Capturing traces that can be rerun with new model versions
- Combining quantitative testing with qualitative "vibes" assessment
One particularly memorable moment came when an engineer in their early access program was overheard screaming "What?! This model - I've never seen this before!" when testing Opus 4. This kind of reaction signals a meaningful breakthrough.
The approach has required Anthropic to completely rearchitect their development infrastructure. Their merge queue had to be rebuilt because so much more code was being generated and submitted - over 70% of their pull requests are now AI-generated.
For companies looking to maximize the value of AI models, the lesson is clear: don't just use models within their comfort zone. Push them to their limits, establish clear ways to measure success, and maintain a systematic approach to testing each new model version against your specific use cases.