Technology thesis · Artificial Intelligence
high conviction growthSynthetic data generation
Synthetic data is now structural to frontier post-training, robotics sim2real and regulated-industry analytics, and the open question is margins for specialist vendors as frontier labs generate their own.
Position maintained continuously · last reviewed Jun 24, 2026
The thesis
Core thesis
Synthetic data solves the three biggest AI training problems: data scarcity (rare events, edge cases), privacy (healthcare, finance), and bias (under-represented populations). It is now structural in frontier post-training, where RLAIF, distillation and self-generated reasoning traces are mainstream. The independent specialist tier has consolidated into platforms: NVIDIA acquired Gretel (March 2025), SAS absorbed Hazys software (late 2024), and Syntho took the MOSTLY AI brand (June 2026); Tonic.ai remains independent. The risk: synthetic data can amplify biases in the seed data, and the open commercial question is margins for specialist vendors as frontier labs increasingly generate their own.
State of the art (2026)
Synthetic data is now load-bearing in three distinct markets, not one. In frontier post-training it is mainstream: RLAIF, distillation and self-generated reasoning traces dominate alignment work at Anthropic, OpenAI, Google DeepMind and DeepSeek, with feared model collapse not materialising in practice when synthetic and real data are mixed. In physical AI, NVIDIA released Cosmos 3 in June 2026 as an open world-foundation model generating physics-aware training data for robotics and autonomous fleets. In regulated tabular data, Mostly AI, Gretel, Tonic.ai and Hazy sell privacy-preserving generation into finance and healthcare. The labelling layer consolidated when Meta took a 49% Scale AI stake (valuing it at $29bn) in June 2025 and hired Alexandr Wang.
Everything below is live inside CanaryIQ
The full analysis behind the verdict — the structure is real; the content unlocks when you log in.
Signal stack
Evidence stacked leading → lagging
Technology-native KPIs
Metrics that predict trajectory, tracked over time
Landscape map
Who builds what — and who depends on whom
Catalyst calendar
Dated events that will move the position
Technology roadmap
Milestones on the path to maturity
Watchlists
Companies, people and papers — each with a remove-by condition
Decision frameworks
The same call, framed for your desk
Thesis changelog
When our view changed, and why
Change our mind
3 disconfirming conditions
The rest is inside
You've read the verdict. The file is much deeper.
The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Synthetic data generation has changed — all live inside CanaryIQ.