Technology thesis · Artificial Intelligence
medium conviction matureReinforcement learning
RL-from-verifiable-rewards is now the dominant post-training paradigm for frontier reasoning; the open bet is whether it generalises past verifiable domains into fuzzy real-world reward.
Position maintained continuously · last reviewed Jun 24, 2026
The thesis
Core thesis
RL powers AlphaGo, robotics control, and the reasoning capabilities in o1-class models. RLHF (reinforcement learning from human feedback) is the key technique for aligning LLMs. As AI moves toward agents that plan and act, RL becomes more central. The challenge: RL is data-hungry, unstable to train, and hard to debug — limiting its application to well-defined reward environments.
State of the art (2026)
RL has shifted from a niche control technique to the dominant post-training paradigm for frontier reasoning. RL-from-verifiable-rewards (RLVR) underpins OpenAI o3, DeepSeek-R1 and V3.2, Gemini Deep Think and Claude reasoning; DeepSeek-V3.2 (December 2025) trained across 1,800-plus agentic environments, and its Speciale variant took IMO and IOI gold. The frontier now runs on outcome-based RL plus test-time compute rather than process reward models. In robotics the picture is more contested: imitation learning has overtaken RL as the primary on-ramp for real-world manipulation, with RL reserved for locomotion and sim-to-real fine-tuning. The open question is whether RLVR generalises beyond verifiable domains — maths, code, tool-use — into fuzzy real-world reward.
Everything below is live inside CanaryIQ
The full analysis behind the verdict — the structure is real; the content unlocks when you log in.
Signal stack
Evidence stacked leading → lagging
Technology-native KPIs
Metrics that predict trajectory, tracked over time
Landscape map
Who builds what — and who depends on whom
Catalyst calendar
Dated events that will move the position
Technology roadmap
Milestones on the path to maturity
Watchlists
Companies, people and papers — each with a remove-by condition
Decision frameworks
The same call, framed for your desk
Thesis changelog
When our view changed, and why
Change our mind
3 disconfirming conditions
The rest is inside
You've read the verdict. The file is much deeper.
The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Reinforcement learning has changed — all live inside CanaryIQ.