Evals: Translating Product Goals to ML Teams
by Nick Turley on August 9, 2025
Nick Turley, head of ChatGPT at OpenAI, outlines a distinctive approach to building AI products that prioritizes rapid shipping, learning from real-world usage, and iterative improvement over traditional product development cycles.
Core Principles of AI Product Development
Ship Fast, Learn Fast
- "This is a pattern with AI where you won't know what to polish until after you ship"
- "The only way to find out what people like and what's valuable is to bring it into the external world"
- "You're gonna be polishing the wrong things in the space... you won't know what to polish until after you ship"
- Shipped ChatGPT in just 10 days from decision to launch, despite many features not being ready
- Prioritize getting real-world feedback over perfecting features in isolation
"Is It Maximally Accelerated?"
- Use this question as a forcing function to understand what's critical path versus what can happen later
- "I just really wanna jump to the punchline of 'why can't we do this now?'"
- Creates a culture where teams constantly question if they're moving as quickly as possible
- Became a team meme with its own Slack emoji to push for faster execution
- Helps cut through blockers and bureaucracy, especially with people from larger companies
Empiricism Over Speculation
- "You really have to ship to understand what is even possible and what people want rather than being able to reason about that a priori"
- Real-world usage reveals unexpected use cases that wouldn't emerge in internal testing
- Models improve through exposure to real failure cases, not just benchmark testing
- "The benchmarks are increasingly saturated so really you need real world scenarios"
- Learning happens both in-product and through social channels (TikTok comments, social media)
Balance Speed with Safety
- Apply different processes for different contexts: "Process is a tool"
- Move quickly on product features, but maintain rigorous processes for safety
- "For frontier models there actually needs to be a rigorous process where you red team, work on the system card, get external input"
- Don't use speed as an excuse to skip critical safety evaluations
Treat the Model as the Product
- "There really is no distinction between the model and the product; the model is the product"
- Iterate on the model like a product by identifying key use cases and systematically improving them
- Focus on both capabilities and "vibes" - the personality and feel of the model
- Understand that model behavior is central to user experience, not just a technical component
Implementation Framework
1. Start with Open-Ended Exploration
- Begin with a broad, flexible interface rather than overly specific use cases
- Allow users to discover their own applications rather than prescribing them
- "ChatGPT feels a little bit like MS DOS; we haven't built Windows yet"
- Let usage patterns emerge naturally before optimizing
2. Use Evals to Communicate Product Goals
- "I started writing evals before I knew what an eval was"
- Evals are simply "articulating success before you do anything else"
- They become the "lingua franca" between product and research teams
- Clearly specify ideal behavior for various use cases
- Use evals to communicate what the product should be doing to researchers
3. Build Interdisciplinary Teams
- "The interdisciplinariness of really making sure that you put research and engineering and design and product together rather than treating them as silos"
- Hire for curiosity over specific AI experience
- "If a feature doesn't get 2x better as the model gets 2x smarter, it's probably not a feature we should be shipping"
- Think like a jazz band rather than an orchestra - ideas can come from anywhere
4. Follow a Three-Part Retention Strategy
- Model improvements (1/3): Systematically improve the model on use cases people care about
- New capabilities (1/3): Add research-driven features like search and personalization
- Traditional product work (1/3): Reduce friction, improve UI, and apply standard product practices
5. Polish After Learning
- "Shipping is just one point on the journey towards awesomeness"
- Once you understand what people are doing, there's no excuse not to polish
- Don't use velocity as an excuse for permanent roughness
- Follow through on refinement after initial learning
When to Apply This Approach
- When working with emergent technology where capabilities aren't fully understood
- In situations where user behavior can't be predicted in advance
- When the cost of delay exceeds the cost of imperfection
- When you need to establish product-market fit for a novel capability
- When you're competing in a rapidly evolving space