Skip to content

Evals: Translating Product Goals to ML Teams

by Nick Turley on August 9, 2025

Nick Turley, head of ChatGPT at OpenAI, outlines a distinctive approach to building AI products that prioritizes rapid shipping, learning from real-world usage, and iterative improvement over traditional product development cycles.

Core Principles of AI Product Development

Ship Fast, Learn Fast

  • "This is a pattern with AI where you won't know what to polish until after you ship"
  • "The only way to find out what people like and what's valuable is to bring it into the external world"
  • "You're gonna be polishing the wrong things in the space... you won't know what to polish until after you ship"
  • Shipped ChatGPT in just 10 days from decision to launch, despite many features not being ready
  • Prioritize getting real-world feedback over perfecting features in isolation

"Is It Maximally Accelerated?"

  • Use this question as a forcing function to understand what's critical path versus what can happen later
  • "I just really wanna jump to the punchline of 'why can't we do this now?'"
  • Creates a culture where teams constantly question if they're moving as quickly as possible
  • Became a team meme with its own Slack emoji to push for faster execution
  • Helps cut through blockers and bureaucracy, especially with people from larger companies

Empiricism Over Speculation

  • "You really have to ship to understand what is even possible and what people want rather than being able to reason about that a priori"
  • Real-world usage reveals unexpected use cases that wouldn't emerge in internal testing
  • Models improve through exposure to real failure cases, not just benchmark testing
  • "The benchmarks are increasingly saturated so really you need real world scenarios"
  • Learning happens both in-product and through social channels (TikTok comments, social media)

Balance Speed with Safety

  • Apply different processes for different contexts: "Process is a tool"
  • Move quickly on product features, but maintain rigorous processes for safety
  • "For frontier models there actually needs to be a rigorous process where you red team, work on the system card, get external input"
  • Don't use speed as an excuse to skip critical safety evaluations

Treat the Model as the Product

  • "There really is no distinction between the model and the product; the model is the product"
  • Iterate on the model like a product by identifying key use cases and systematically improving them
  • Focus on both capabilities and "vibes" - the personality and feel of the model
  • Understand that model behavior is central to user experience, not just a technical component

Implementation Framework

1. Start with Open-Ended Exploration

  • Begin with a broad, flexible interface rather than overly specific use cases
  • Allow users to discover their own applications rather than prescribing them
  • "ChatGPT feels a little bit like MS DOS; we haven't built Windows yet"
  • Let usage patterns emerge naturally before optimizing

2. Use Evals to Communicate Product Goals

  • "I started writing evals before I knew what an eval was"
  • Evals are simply "articulating success before you do anything else"
  • They become the "lingua franca" between product and research teams
  • Clearly specify ideal behavior for various use cases
  • Use evals to communicate what the product should be doing to researchers

3. Build Interdisciplinary Teams

  • "The interdisciplinariness of really making sure that you put research and engineering and design and product together rather than treating them as silos"
  • Hire for curiosity over specific AI experience
  • "If a feature doesn't get 2x better as the model gets 2x smarter, it's probably not a feature we should be shipping"
  • Think like a jazz band rather than an orchestra - ideas can come from anywhere

4. Follow a Three-Part Retention Strategy

  • Model improvements (1/3): Systematically improve the model on use cases people care about
  • New capabilities (1/3): Add research-driven features like search and personalization
  • Traditional product work (1/3): Reduce friction, improve UI, and apply standard product practices

5. Polish After Learning

  • "Shipping is just one point on the journey towards awesomeness"
  • Once you understand what people are doing, there's no excuse not to polish
  • Don't use velocity as an excuse for permanent roughness
  • Follow through on refinement after initial learning

When to Apply This Approach

  • When working with emergent technology where capabilities aren't fully understood
  • In situations where user behavior can't be predicted in advance
  • When the cost of delay exceeds the cost of imperfection
  • When you need to establish product-market fit for a novel capability
  • When you're competing in a rapidly evolving space