OCDevel
Walk

Best Open-Source TTS Models 2026: Free ElevenLabs Alternatives

Jul 20, 2025
Try these models without setup

My free tool runs Kokoro with 50+ voices, plus Qwen3 for voice cloning. Paste text or markdown and get a podcast episode.

Create Free Podcast Episode

Open-source text-to-speech caught up to the paid tools this year, and most "best TTS" roundups haven't noticed. Here's the honest version: the models topping the TTS Arena leaderboard are mostly ones you can't actually buy. CastleFlow and Vocu V3.0 sit at the top with no real Western API. What's left splits into two groups that both undercut ElevenLabs: cheap hosted APIs (Inworld, MiniMax, Fish Audio) that run 5-10x cheaper, and open-weights models (Kokoro, Chatterbox, Qwen3-TTS) you run yourself for free. Kokoro is 82M parameters and runs on a CPU. Chatterbox clones a voice from ten seconds of audio. Below: what's worth using in mid-2026, and how to pick without reading twelve model cards.

TTS Rankings: Arena ELO, Price, and Latency (June 2026)

ELO scores come from the TTS Arena V2 blind-preference leaderboard and move week to week. The top six sit within ~13 ELO, so treat them as a tie, not a ranking. Prices are list API rates per 1M characters; latency is vendor-reported time to first audio.

RankModelELO$/1MLatencyNotes
#1CastleFlow v1.01574--Proprietary, no public API
#2Vocu V3.01573--China-market, limited Western access
#3Inworld TTS 1.5 Max1572$25under 250ms15 langs, voice cloning from 15s. Best accessible quality
#4Inworld TTS 1.5 Mini1565$15under 130ms15 langs, cheapest low-latency option
#5Hume Octave 21561~$120~100msBest emotional expressiveness (64% win rate)
#6Papla P11561--API + voice cloning
#7MiniMax Speech 2.8 Turbo1542~$30under 250ms40+ langs (arena still lists the older "02")
#8ElevenLabs Turbo v2.51539$50~75ms30+ langs, real-time
#9MiniMax Speech 2.8 HD1535~$50-40+ langs, higher fidelity
#10ElevenLabs Flash v2.51532$50~75msFastest ElevenLabs tier
#11ElevenLabs Multilingual v21528$100-29 langs, studio polish
#12Chatterbox1518Free-Open-source (MIT), best cloning on a small GPU
#13Cartesia Sonic 21513~$35~90msLowest latency (Sonic 3 shipped but isn't ranked yet)
n/rFish Audio S2n/r$15*~200ms80+ langs, 10s cloning, open-weights (research license). Not on the arena
n/rKokoron/rFree-Open-source (Apache 2.0), 82M params, runs on CPU. See below

* Fish prices per 1M UTF-8 bytes, which is roughly per-character for English but 2-3x that for Chinese, Japanese, or emoji-heavy text.

Two things the leaderboard won't tell you. ElevenLabs shipped v3 in March 2026, its most expressive model yet, but ElevenLabs themselves say it isn't for real-time use, and it isn't on the arena yet. And leaderboard rank no longer maps to what you can buy: ElevenLabs remains the default people reach for despite ranking 8th to 11th, because the top entries either have no Western API or no track record.

How to Choose

Pick the one constraint that actually binds you. Most people optimize the wrong axis.

Best Open-Source TTS Models (Run Locally)

Every model here runs on a gaming laptop (RTX 3060 or better), and Kokoro runs without a GPU at all. Apple Silicon (M1-M4) handles Kokoro and Chatterbox through MPS.

ModelParamsVRAMSpeedVoice cloningLicense
Kokoro82M2-3GB~200x real-time (4090), runs on CPUNo (54 preset voices)Apache 2.0
Chatterbox-Turbo350M~6GB~2x real-time, ~470ms first chunkYes (7-10s)MIT
Chatterbox Multilingual0.5B8-16GB~real-time on GPUYes (7-10s), 23 langsMIT
Qwen3-TTS0.6B / 1.7B4-8GB97ms streamingYes (3s)Apache 2.0
CosyVoice 3.00.5B~4GB150ms streamingYes (zero-shot)Apache 2.0
Fish Audio S24B12-24GBRTF 0.195 (datacenter GPU)Yes (10-30s)Research license (commercial = paid)
Voxtral TTS (Mistral)--streamingYes (3s)Open-weight
F5-TTS~0.3B~4GB7x real-time (33x Fast)YesMIT
VibeVoice (Microsoft)1.5B~8GBlong-form podcast generationMulti-speakerMIT (research only)
PipertinyCPU / RPiedge real-timeNoMIT

Kokoro is still the efficiency champion: 82M parameters, 8 languages, 54 baked-in voices, and it'll do 36x real-time on a free Colab T4 or 5x on a 32-core CPU. One catch the old guides get wrong - it can't clone a voice. It ships fixed voicepacks and nothing else. If you need your own voice, you need a different model.

Best Open-Source Voice Cloning (and the "vs ElevenLabs" question)

The cloning crown depends on what you're willing to license and run:

About that "X% preferred over ElevenLabs" number you'll see everywhere: every one of them (Chatterbox's 65%, Voxtral's 63%, Fish's 66%) comes from a benchmark the model's own maker ran. Fish Audio's March 2026 blind test, for instance, ranked S2 first at 65.7%, but the listeners were sampled from Fish's own platform. Useful signal, not independent proof.

Best TTS for Each Use Case

Use caseHosted pickLocal pick
Audiobooks / long narrationElevenLabs Multilingual v2Chatterbox
Real-time voice agentInworld Mini ($15) or Cartesia SonicKokoro or Qwen3-TTS
Voice cloning projectFish Audio S2 ($15, 10s) or InworldChatterbox-Turbo or Qwen3-TTS
Multilingual (40+ langs)MiniMax 2.8 or Fish Audio S2 (80+)Qwen3-TTS or CosyVoice 3
CPU / laptop tinkering-Kokoro (82M, no GPU needed)
Most expressive / emotionalHume Octave 2Chatterbox
Hear these models before you install anything

My free tool runs Kokoro (50+ voices) and Qwen3 voice cloning right in the browser - no Python, no GPU. Paste an article or drop in an ePub, pick a voice, and download a podcast episode.

Make a Free Episode

Bottom Line