What is Voicebox?
Voicebox is a local-first voice cloning studio -- a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio or pick from 50+ preset voices, generate speech in 23 languages across 7 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.
- Complete privacy -- models and voice data stay on your machine
- 7 TTS engines -- Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
- Cloning and preset voices -- zero-shot cloning from a reference sample, or curated preset voices via Kokoro (50 voices) and Qwen CustomVoice (9 voices)
- 23 languages -- from English to Arabic, Japanese, Hindi, Swahili, and more
- Post-processing effects -- pitch shift, reverb, delay, chorus, compression, and filters
- Expressive speech -- paralinguistic tags like
[laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
- Unlimited length -- auto-chunking with crossfade for scripts, articles, and chapters
- Stories editor -- multi-track timeline for conversations, podcasts, and narratives
- API-first -- REST API for integrating voice synthesis into your own projects
- Native performance -- built with Tauri (Rust), not Electron
- Runs everywhere -- macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker
TTS Engines
Seven engines with different strengths, switchable per-generation:
| Engine |
Profile Type |
Languages |
Strengths |
| Qwen3-TTS (0.6B / 1.7B) |
Cloned |
10 |
High-quality multilingual cloning |
| Qwen CustomVoice (0.6B / 1.7B) |
Preset (9 voices) |
10 |
Natural-language delivery control (tone, emotion, pace) |
| LuxTTS |
Cloned |
English |
Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU |
| Chatterbox Multilingual |
Cloned |
23 |
Broadest language coverage |
| Chatterbox Turbo |
Cloned |
English |
Fast 350M model with paralinguistic emotion/sound tags |
| TADA (1B / 3B) |
Cloned |
10 |
HumeAI speech-language model -- 700s+ coherent audio |
| Kokoro |
Preset (50 voices) |
9 |
82M parameters, CPU realtime, lowest VRAM of any engine |
GPU Support
| Platform |
Backend |
Notes |
| macOS (Apple Silicon) |
MLX (Metal) |
4-5x faster via Neural Engine |
| Windows / Linux (NVIDIA) |
PyTorch (CUDA) |
Auto-downloads CUDA binary from within the app |
| Linux (AMD) |
PyTorch (ROCm) |
Auto-configures HSA_OVERRIDE_GFX_VERSION |
| Windows (any GPU) |
DirectML |
Universal Windows GPU support |
| Intel Arc |
IPEX/XPU |
Intel discrete GPU acceleration |
| Any |
CPU |
Works everywhere, just slower |
Use Cases
- Game development -- generate dynamic dialogue for characters
- Content creation -- produce podcasts and video voiceovers
- Accessibility -- build text-to-speech tools for users who need them
- Voice assistants -- create custom voice interfaces
- Production pipelines -- automate voiceover workflows via the REST API
Tech Stack
| Layer |
Technology |
| Desktop App |
Tauri (Rust) |
| Frontend |
React, TypeScript, Tailwind CSS |
| State |
Zustand, React Query |
| Backend |
FastAPI (Python) |
| TTS Engines |
Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro |
| Effects |
Pedalboard (Spotify) |
| Transcription |
Whisper / Whisper Turbo (PyTorch or MLX) |
| Inference |
MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU) |
| Database |
SQLite |
| Audio |
WaveSurfer.js, librosa |