We use third-party cookies in order to personalize your site experience. See our Privacy Policy.

Technology thesis · Artificial Intelligence

medium conviction growth

Speech AI and voice synthesis

Voice is now production infrastructure, not a demo: sub-100ms speech-to-speech agents are displacing tier-1 BPO work; deepfake fraud and unresolved music-IP rulings are the live limits on scale.

Position maintained continuously · last reviewed Jun 24, 2026

The thesis

Core thesis

Voice synthesis crossed the uncanny valley — ElevenLabs and Play.ht produce voices indistinguishable from real humans. Applications: audiobook production (cost drops 90%), real-time translation, accessibility. But also: voice deepfake attacks ($25M wire transfer fraud via CEO voice clone). The same technology enables and threatens. Market growing 15%+ annually.

Core thesis

Voice synthesis crossed the uncanny valley years ago; the live question in 2026 is deployment, not naturalness. Real-time speech-to-speech is the default architecture – OpenAI's gpt-realtime reached GA in August 2025, Cartesia's Sonic-3.5 ships sub-100ms synthesis, and ElevenLabs raised a $500M Series D at an $11B valuation in February 2026. Voice agents now handle tier-1 contact-centre work end to end. The same capability cuts both ways: deepfake voice fraud (a CFO-clone authorised a $25M wire transfer on a live call) is driving authentication mandates, and AI music generation faces unresolved IP rulings. The structural growth is real; regulation and fraud liability are the live limits on scale.

State of the art (2026)

Real-time speech-to-speech is now the default architecture, not a research demo. OpenAI's gpt-realtime reached general availability in August 2025, and specialists ship sub-100ms synthesis – Cartesia's Sonic-3.5 and ElevenLabs' multilingual stack lead, with ElevenLabs raising a $500M Series D in February 2026 at an $11B valuation. The frontier has moved from naturalness to deployable voice agents handling tier-1 contact-centre work end to end. Two constraints now bind commercial scale: deepfake voice fraud (the $25M CFO-clone wire transfer set the template) driving authentication mandates, and music-IP exposure – Universal settled with Udio in October 2025 and Warner with Suno in November 2025, while Sony's fair-use ruling against both is expected in summer 2026.

The rest of the file

Everything below is live inside CanaryIQ

The full analysis behind the verdict — the structure is real; the content unlocks when you log in.

Signal stack

Evidence stacked leading → lagging

9 signals
talent
research
patent
expert
operational
regulatory
market

Technology-native KPIs

Metrics that predict trajectory, tracked over time

4 tracked
Speech and voice recognition market size
TTS API calls per day
Voice clone detection accuracy
Languages supported by leading TTS

Landscape map

Who builds what — and who depends on whom

93 players · 6 layers

Catalyst calendar

Dated events that will move the position

6 ahead

Technology roadmap

Milestones on the path to maturity

8 milestones

Watchlists

Companies, people and papers — each with a remove-by condition

20 · 18
Companies · 20
People · 18

Decision frameworks

The same call, framed for your desk

Locked
Public Equity
PE / VC
Corporate Leader

Thesis changelog

When our view changed, and why

4 updates

Change our mind

3 disconfirming conditions

The rest is inside

You've read the verdict. The file is much deeper.

The full signal stack, technology-native KPIs tracked over time, the landscape of who depends on whom, the dated catalyst calendar, decision frameworks for every desk, live watchlists and the changelog of every time our call on Speech AI and voice synthesis has changed — all live inside CanaryIQ.