Evals: Translating Product Goals to ML Teams

Nick Turley, head of ChatGPT at OpenAI, outlines a distinctive approach to building AI products that prioritizes rapid shipping, learning from real-world usage, and iterative improvement over traditional product development cycles.

Core Principles of AI Product Development

Ship Fast, Learn Fast

"This is a pattern with AI where you won't know what to polish until after you ship"
"The only way to find out what people like and what's valuable is to bring it into the external world"
"You're gonna be polishing the wrong things in the space... you won't know what to polish until after you ship"
Shipped ChatGPT in just 10 days from decision to launch, despite many features not being ready
Prioritize getting real-world feedback over perfecting features in isolation

"Is It Maximally Accelerated?"

Use this question as a forcing function to understand what's critical path versus what can happen later
"I just really wanna jump to the punchline of 'why can't we do this now?'"
Creates a culture where teams constantly question if they're moving as quickly as possible
Became a team meme with its own Slack emoji to push for faster execution
Helps cut through blockers and bureaucracy, especially with people from larger companies

Empiricism Over Speculation

"You really have to ship to understand what is even possible and what people want rather than being able to reason about that a priori"
Real-world usage reveals unexpected use cases that wouldn't emerge in internal testing
Models improve through exposure to real failure cases, not just benchmark testing
"The benchmarks are increasingly saturated so really you need real world scenarios"
Learning happens both in-product and through social channels (TikTok comments, social media)

Balance Speed with Safety

Apply different processes for different contexts: "Process is a tool"
Move quickly on product features, but maintain rigorous processes for safety
"For frontier models there actually needs to be a rigorous process where you red team, work on the system card, get external input"
Don't use speed as an excuse to skip critical safety evaluations

Treat the Model as the Product

"There really is no distinction between the model and the product; the model is the product"
Iterate on the model like a product by identifying key use cases and systematically improving them
Focus on both capabilities and "vibes" - the personality and feel of the model
Understand that model behavior is central to user experience, not just a technical component

Implementation Framework

1. Start with Open-Ended Exploration

Begin with a broad, flexible interface rather than overly specific use cases
Allow users to discover their own applications rather than prescribing them
"ChatGPT feels a little bit like MS DOS; we haven't built Windows yet"
Let usage patterns emerge naturally before optimizing

2. Use Evals to Communicate Product Goals

"I started writing evals before I knew what an eval was"
Evals are simply "articulating success before you do anything else"
They become the "lingua franca" between product and research teams
Clearly specify ideal behavior for various use cases
Use evals to communicate what the product should be doing to researchers

3. Build Interdisciplinary Teams

"The interdisciplinariness of really making sure that you put research and engineering and design and product together rather than treating them as silos"
Hire for curiosity over specific AI experience
"If a feature doesn't get 2x better as the model gets 2x smarter, it's probably not a feature we should be shipping"
Think like a jazz band rather than an orchestra - ideas can come from anywhere

4. Follow a Three-Part Retention Strategy

Model improvements (1/3): Systematically improve the model on use cases people care about
New capabilities (1/3): Add research-driven features like search and personalization
Traditional product work (1/3): Reduce friction, improve UI, and apply standard product practices

5. Polish After Learning

"Shipping is just one point on the journey towards awesomeness"
Once you understand what people are doing, there's no excuse not to polish
Don't use velocity as an excuse for permanent roughness
Follow through on refinement after initial learning

When to Apply This Approach

When working with emergent technology where capabilities aren't fully understood
In situations where user behavior can't be predicted in advance
When the cost of delay exceeds the cost of imperfection
When you need to establish product-market fit for a novel capability
When you're competing in a rapidly evolving space