Best Open-Source TTS Models 2026: Free ElevenLabs Alternatives

Jul 20, 2025

Try these models without setup

My free tool runs Kokoro with 50+ voices, plus Qwen3 for voice cloning. Paste text or markdown and get a podcast episode.

Create Free Podcast Episode →

Open-source text-to-speech caught up to the paid tools this year, and most "best TTS" roundups haven't noticed. Here's the honest version: the models topping the TTS Arena leaderboard are mostly ones you can't actually buy. CastleFlow and Vocu V3.0 sit at the top with no real Western API. What's left splits into two groups that both undercut ElevenLabs: cheap hosted APIs (Inworld, MiniMax, Fish Audio) that run 5-10x cheaper, and open-weights models (Kokoro, Chatterbox, Qwen3-TTS) you run yourself for free. Kokoro is 82M parameters and runs on a CPU. Chatterbox clones a voice from ten seconds of audio. Below: what's worth using in mid-2026, and how to pick without reading twelve model cards.

TTS Rankings: Arena ELO, Price, and Latency (June 2026)

ELO scores come from the TTS Arena V2 blind-preference leaderboard and move week to week. The top six sit within ~13 ELO, so treat them as a tie, not a ranking. Prices are list API rates per 1M characters; latency is vendor-reported time to first audio.

Rank	Model	ELO	$/1M	Latency	Notes
#1	CastleFlow v1.0	1574	-	-	Proprietary, no public API
#2	Vocu V3.0	1573	-	-	China-market, limited Western access
#3	Inworld TTS 1.5 Max	1572	$25	under 250ms	15 langs, voice cloning from 15s. Best accessible quality
#4	Inworld TTS 1.5 Mini	1565	$15	under 130ms	15 langs, cheapest low-latency option
#5	Hume Octave 2	1561	~$120	~100ms	Best emotional expressiveness (64% win rate)
#6	Papla P1	1561	-	-	API + voice cloning
#7	MiniMax Speech 2.8 Turbo	1542	~$30	under 250ms	40+ langs (arena still lists the older "02")
#8	ElevenLabs Turbo v2.5	1539	$50	~75ms	30+ langs, real-time
#9	MiniMax Speech 2.8 HD	1535	~$50	-	40+ langs, higher fidelity
#10	ElevenLabs Flash v2.5	1532	$50	~75ms	Fastest ElevenLabs tier
#11	ElevenLabs Multilingual v2	1528	$100	-	29 langs, studio polish
#12	Chatterbox	1518	Free	-	Open-source (MIT), best cloning on a small GPU
#13	Cartesia Sonic 2	1513	~$35	~90ms	Lowest latency (Sonic 3 shipped but isn't ranked yet)
n/r	Fish Audio S2	n/r	$15*	~200ms	80+ langs, 10s cloning, open-weights (research license). Not on the arena
n/r	Kokoro	n/r	Free	-	Open-source (Apache 2.0), 82M params, runs on CPU. See below

* Fish prices per 1M UTF-8 bytes, which is roughly per-character for English but 2-3x that for Chinese, Japanese, or emoji-heavy text.

Two things the leaderboard won't tell you. ElevenLabs shipped v3 in March 2026, its most expressive model yet, but ElevenLabs themselves say it isn't for real-time use, and it isn't on the arena yet. And leaderboard rank no longer maps to what you can buy: ElevenLabs remains the default people reach for despite ranking 8th to 11th, because the top entries either have no Western API or no track record.

How to Choose

Pick the one constraint that actually binds you. Most people optimize the wrong axis.

Cost. Free and local: Kokoro or Chatterbox. Cheap hosted with voice cloning: Fish Audio S2 (~$15/1M, 80+ langs) or Inworld Mini ($15/1M). All four land under a tenth of ElevenLabs Multilingual.
Latency. Sub-100ms: Cartesia Sonic or Qwen3-TTS (97ms streaming). Sub-150ms: Inworld Mini or ElevenLabs Flash.
Quality you can actually access. Inworld TTS Max leads the accessible field. For emotion and delivery, Hume Octave 2 wins more blind votes than anything else in the top tier.
Voice cloning. Covered in its own section below, because the "best" depends entirely on your license and your GPU.
Runs on my hardware. Kokoro on a CPU or a 3GB card. Chatterbox-Turbo or Qwen3-TTS if you have a gaming GPU.
Many languages. MiniMax 2.8 and Fish Audio S2 both clear 40+ (Fish claims 80+, though only Japanese, English, and Chinese hit top-tier quality). Local: Qwen3-TTS or CosyVoice 3.

Best Open-Source TTS Models (Run Locally)

Every model here runs on a gaming laptop (RTX 3060 or better), and Kokoro runs without a GPU at all. Apple Silicon (M1-M4) handles Kokoro and Chatterbox through MPS.

Model	Params	VRAM	Speed	Voice cloning	License
Kokoro	82M	2-3GB	~200x real-time (4090), runs on CPU	No (54 preset voices)	Apache 2.0
Chatterbox-Turbo	350M	~6GB	~2x real-time, ~470ms first chunk	Yes (7-10s)	MIT
Chatterbox Multilingual	0.5B	8-16GB	~real-time on GPU	Yes (7-10s), 23 langs	MIT
Qwen3-TTS	0.6B / 1.7B	4-8GB	97ms streaming	Yes (3s)	Apache 2.0
CosyVoice 3.0	0.5B	~4GB	150ms streaming	Yes (zero-shot)	Apache 2.0
Fish Audio S2	4B	12-24GB	RTF 0.195 (datacenter GPU)	Yes (10-30s)	Research license (commercial = paid)
Voxtral TTS (Mistral)	-	-	streaming	Yes (3s)	Open-weight
F5-TTS	~0.3B	~4GB	7x real-time (33x Fast)	Yes	MIT
VibeVoice (Microsoft)	1.5B	~8GB	long-form podcast generation	Multi-speaker	MIT (research only)
Piper	tiny	CPU / RPi	edge real-time	No	MIT

Kokoro is still the efficiency champion: 82M parameters, 8 languages, 54 baked-in voices, and it'll do 36x real-time on a free Colab T4 or 5x on a 32-core CPU. One catch the old guides get wrong - it can't clone a voice. It ships fixed voicepacks and nothing else. If you need your own voice, you need a different model.

Best Open-Source Voice Cloning (and the "vs ElevenLabs" question)

The cloning crown depends on what you're willing to license and run:

Best on a gaming GPU: Chatterbox-Turbo. Resemble AI's blind test had listeners prefer it over ElevenLabs Turbo 65% of the time. That's a vendor-run study, so weight it accordingly, but the model genuinely clones from ten seconds of reference audio and fits in ~6GB. For "Chatterbox vs Kokoro," it isn't close - Kokoro doesn't clone, so Chatterbox wins by default.
Best permissive license: Qwen3-TTS. Clones from a 3-second sample, Apache 2.0, no commercial strings. Alibaba's suite (0.6B and 1.7B variants) posts the lowest word-error rates of any open model. If you're shipping a product, this is the safe pick - and there's a full Qwen3-TTS voice-cloning walkthrough if you want the setup steps.
Best raw accuracy: Fish Audio S2. It posts the lowest word-error rate on Seed-TTS-Eval per Fish's own technical report. The trade-offs are real: it's a 4B model that wants a 12GB+ card, and the weights are free for research only - commercial use requires a paid license from Fish. Their hosted API sidesteps the self-hosting at ~$15/1M.

About that "X% preferred over ElevenLabs" number you'll see everywhere: every one of them (Chatterbox's 65%, Voxtral's 63%, Fish's 66%) comes from a benchmark the model's own maker ran. Fish Audio's March 2026 blind test, for instance, ranked S2 first at 65.7%, but the listeners were sampled from Fish's own platform. Useful signal, not independent proof.

Best TTS for Each Use Case

Use case	Hosted pick	Local pick
Audiobooks / long narration	ElevenLabs Multilingual v2	Chatterbox
Real-time voice agent	Inworld Mini ($15) or Cartesia Sonic	Kokoro or Qwen3-TTS
Voice cloning project	Fish Audio S2 ($15, 10s) or Inworld	Chatterbox-Turbo or Qwen3-TTS
Multilingual (40+ langs)	MiniMax 2.8 or Fish Audio S2 (80+)	Qwen3-TTS or CosyVoice 3
CPU / laptop tinkering	-	Kokoro (82M, no GPU needed)
Most expressive / emotional	Hume Octave 2	Chatterbox

Hear these models before you install anything

My free tool runs Kokoro (50+ voices) and Qwen3 voice cloning right in the browser - no Python, no GPU. Paste an article or drop in an ePub, pick a voice, and download a podcast episode.

Make a Free Episode →

Bottom Line

Best free + local: Kokoro for speed and CPU deployment, Chatterbox-Turbo when you need cloning on a gaming GPU.
Best permissive cloning: Qwen3-TTS (Apache 2.0, clones from 3 seconds).
Best cheap hosted with cloning: Fish Audio S2 (~$15/1M, 80+ langs) or Inworld Mini.
ElevenLabs: still the polish-and-ecosystem default, and v3 is the one to use for studio work, but it no longer leads on price or blind-test quality.

Discuss & comment on Reddit