My free tool runs Kokoro with 50+ voices, plus Qwen3 for voice cloning. Paste text or markdown and get a podcast episode.
Open-source text-to-speech caught up to the paid tools this year, and most "best TTS" roundups haven't noticed. Here's the honest version: the models topping the TTS Arena leaderboard are mostly ones you can't actually buy. CastleFlow and Vocu V3.0 sit at the top with no real Western API. What's left splits into two groups that both undercut ElevenLabs: cheap hosted APIs (Inworld, MiniMax, Fish Audio) that run 5-10x cheaper, and open-weights models (Kokoro, Chatterbox, Qwen3-TTS) you run yourself for free. Kokoro is 82M parameters and runs on a CPU. Chatterbox clones a voice from ten seconds of audio. Below: what's worth using in mid-2026, and how to pick without reading twelve model cards.
ELO scores come from the TTS Arena V2 blind-preference leaderboard and move week to week. The top six sit within ~13 ELO, so treat them as a tie, not a ranking. Prices are list API rates per 1M characters; latency is vendor-reported time to first audio.
| Rank | Model | ELO | $/1M | Latency | Notes |
|---|---|---|---|---|---|
| #1 | CastleFlow v1.0 | 1574 | - | - | Proprietary, no public API |
| #2 | Vocu V3.0 | 1573 | - | - | China-market, limited Western access |
| #3 | Inworld TTS 1.5 Max | 1572 | $25 | under 250ms | 15 langs, voice cloning from 15s. Best accessible quality |
| #4 | Inworld TTS 1.5 Mini | 1565 | $15 | under 130ms | 15 langs, cheapest low-latency option |
| #5 | Hume Octave 2 | 1561 | ~$120 | ~100ms | Best emotional expressiveness (64% win rate) |
| #6 | Papla P1 | 1561 | - | - | API + voice cloning |
| #7 | MiniMax Speech 2.8 Turbo | 1542 | ~$30 | under 250ms | 40+ langs (arena still lists the older "02") |
| #8 | ElevenLabs Turbo v2.5 | 1539 | $50 | ~75ms | 30+ langs, real-time |
| #9 | MiniMax Speech 2.8 HD | 1535 | ~$50 | - | 40+ langs, higher fidelity |
| #10 | ElevenLabs Flash v2.5 | 1532 | $50 | ~75ms | Fastest ElevenLabs tier |
| #11 | ElevenLabs Multilingual v2 | 1528 | $100 | - | 29 langs, studio polish |
| #12 | Chatterbox | 1518 | Free | - | Open-source (MIT), best cloning on a small GPU |
| #13 | Cartesia Sonic 2 | 1513 | ~$35 | ~90ms | Lowest latency (Sonic 3 shipped but isn't ranked yet) |
| n/r | Fish Audio S2 | n/r | $15* | ~200ms | 80+ langs, 10s cloning, open-weights (research license). Not on the arena |
| n/r | Kokoro | n/r | Free | - | Open-source (Apache 2.0), 82M params, runs on CPU. See below |
* Fish prices per 1M UTF-8 bytes, which is roughly per-character for English but 2-3x that for Chinese, Japanese, or emoji-heavy text.
Two things the leaderboard won't tell you. ElevenLabs shipped v3 in March 2026, its most expressive model yet, but ElevenLabs themselves say it isn't for real-time use, and it isn't on the arena yet. And leaderboard rank no longer maps to what you can buy: ElevenLabs remains the default people reach for despite ranking 8th to 11th, because the top entries either have no Western API or no track record.
Pick the one constraint that actually binds you. Most people optimize the wrong axis.
Every model here runs on a gaming laptop (RTX 3060 or better), and Kokoro runs without a GPU at all. Apple Silicon (M1-M4) handles Kokoro and Chatterbox through MPS.
| Model | Params | VRAM | Speed | Voice cloning | License |
|---|---|---|---|---|---|
| Kokoro | 82M | 2-3GB | ~200x real-time (4090), runs on CPU | No (54 preset voices) | Apache 2.0 |
| Chatterbox-Turbo | 350M | ~6GB | ~2x real-time, ~470ms first chunk | Yes (7-10s) | MIT |
| Chatterbox Multilingual | 0.5B | 8-16GB | ~real-time on GPU | Yes (7-10s), 23 langs | MIT |
| Qwen3-TTS | 0.6B / 1.7B | 4-8GB | 97ms streaming | Yes (3s) | Apache 2.0 |
| CosyVoice 3.0 | 0.5B | ~4GB | 150ms streaming | Yes (zero-shot) | Apache 2.0 |
| Fish Audio S2 | 4B | 12-24GB | RTF 0.195 (datacenter GPU) | Yes (10-30s) | Research license (commercial = paid) |
| Voxtral TTS (Mistral) | - | - | streaming | Yes (3s) | Open-weight |
| F5-TTS | ~0.3B | ~4GB | 7x real-time (33x Fast) | Yes | MIT |
| VibeVoice (Microsoft) | 1.5B | ~8GB | long-form podcast generation | Multi-speaker | MIT (research only) |
| Piper | tiny | CPU / RPi | edge real-time | No | MIT |
Kokoro is still the efficiency champion: 82M parameters, 8 languages, 54 baked-in voices, and it'll do 36x real-time on a free Colab T4 or 5x on a 32-core CPU. One catch the old guides get wrong - it can't clone a voice. It ships fixed voicepacks and nothing else. If you need your own voice, you need a different model.
The cloning crown depends on what you're willing to license and run:
About that "X% preferred over ElevenLabs" number you'll see everywhere: every one of them (Chatterbox's 65%, Voxtral's 63%, Fish's 66%) comes from a benchmark the model's own maker ran. Fish Audio's March 2026 blind test, for instance, ranked S2 first at 65.7%, but the listeners were sampled from Fish's own platform. Useful signal, not independent proof.
| Use case | Hosted pick | Local pick |
|---|---|---|
| Audiobooks / long narration | ElevenLabs Multilingual v2 | Chatterbox |
| Real-time voice agent | Inworld Mini ($15) or Cartesia Sonic | Kokoro or Qwen3-TTS |
| Voice cloning project | Fish Audio S2 ($15, 10s) or Inworld | Chatterbox-Turbo or Qwen3-TTS |
| Multilingual (40+ langs) | MiniMax 2.8 or Fish Audio S2 (80+) | Qwen3-TTS or CosyVoice 3 |
| CPU / laptop tinkering | - | Kokoro (82M, no GPU needed) |
| Most expressive / emotional | Hume Octave 2 | Chatterbox |
My free tool runs Kokoro (50+ voices) and Qwen3 voice cloning right in the browser - no Python, no GPU. Paste an article or drop in an ePub, pick a voice, and download a podcast episode.