Model Advances Require New Data Types
by Garrett Lord on August 24, 2025
The shift from generalist to expert data labeling in AI model training
As AI models have evolved, the focus has shifted from pre-training on existing internet data to post-training with specialized expert knowledge. This transition represents a significant opportunity for companies with access to expert networks.
The evolution of AI model training data
-
Pre-training phase (past focus):
- Ingesting "the entire corpus of written human knowledge"
- Absorbing content from books, videos, and websites
- Reached diminishing returns as models consumed most available internet data
- "They had essentially sucked up all of the knowledge on the internet"
-
Post-training phase (current focus):
- Augmenting models with high-quality specialized data
- Targeting specific capability areas (coding, mathematics, law, finance)
- Requiring domain experts rather than generalists
- "The models have gotten so good that the generalists are no longer needed"
Types of valuable post-training data
-
Reinforcement learning with human feedback (RLHF)
- Preference ranking data (which response is better, A or B?)
- Helps models understand quality differences between outputs
-
Supervised fine-tuning (SFT)
- Prompt-response pairs created by experts
- Step-by-step reasoning for complex problems
- Detailed explanations showing correct thought processes
-
Trajectory data
- Complete workflows showing how experts solve problems
- Screen recordings with mouse movements and tool interactions
- Narration of decision-making processes
- "People narrating over their step-by-step tool use"
-
Domain-specific knowledge
- Targeting areas where models consistently fail
- Providing ground truth answers in specialized fields
- Identifying reasoning errors in complex domains
The strategic advantage of expert networks
-
Access to specialized talent is the primary moat
- "The only moat in human data is access to an audience"
- Traditional providers spend millions on recruitment and advertising
- Established networks have dramatically lower acquisition costs
-
Expert quality matters more than quantity
- PhDs and specialists can identify model weaknesses others can't
- "If you're a PhD in physics, you can go in and prove where the model's actually breaking"
- Domain experts can create higher-value training data
-
Community and experience drive retention
- Experts expect different treatment than general laborers
- Training and community building improve data quality
- Higher retention rates lead to better lifetime value
Future of AI data collection
-
Data types will continue to evolve
- "CAD files, scientific tool use data, esoteric operating systems"
- Multimodal data combining text, audio, and video
- Trajectory data showing complete expert workflows
-
Human expertise will remain critical
- "For as long as models are improving, humans will be needed in this process"
- The scientific iteration process requires human judgment
- Domain experts will continue finding edge cases and improvements
-
The market will increasingly value specialized knowledge
- Academic expertise in STEM fields
- Professional domain knowledge in regulated industries
- Practical experience with complex workflows