Introduction | Voicebox

What is Voicebox?

Voicebox is a local-first voice cloning studio -- a free and open-source alternative to ElevenLabs. Clone voices from a few seconds of audio or pick from 50+ preset voices, generate speech in 23 languages across 7 TTS engines, apply post-processing effects, and compose multi-voice projects with a timeline editor.

Complete privacy -- models and voice data stay on your machine
7 TTS engines -- Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro
Cloning and preset voices -- zero-shot cloning from a reference sample, or curated preset voices via Kokoro (50 voices) and Qwen CustomVoice (9 voices)
23 languages -- from English to Arabic, Japanese, Hindi, Swahili, and more
Post-processing effects -- pitch shift, reverb, delay, chorus, compression, and filters
Expressive speech -- paralinguistic tags like [laugh], [sigh], [gasp] via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice
Unlimited length -- auto-chunking with crossfade for scripts, articles, and chapters
Stories editor -- multi-track timeline for conversations, podcasts, and narratives
API-first -- REST API for integrating voice synthesis into your own projects
Native performance -- built with Tauri (Rust), not Electron
Runs everywhere -- macOS (MLX/Metal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker

TTS Engines

Seven engines with different strengths, switchable per-generation:

Engine	Profile Type	Languages	Strengths
Qwen3-TTS (0.6B / 1.7B)	Cloned	10	High-quality multilingual cloning
Qwen CustomVoice (0.6B / 1.7B)	Preset (9 voices)	10	Natural-language delivery control (tone, emotion, pace)
LuxTTS	Cloned	English	Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU
Chatterbox Multilingual	Cloned	23	Broadest language coverage
Chatterbox Turbo	Cloned	English	Fast 350M model with paralinguistic emotion/sound tags
TADA (1B / 3B)	Cloned	10	HumeAI speech-language model -- 700s+ coherent audio
Kokoro	Preset (50 voices)	9	82M parameters, CPU realtime, lowest VRAM of any engine

GPU Support

Platform	Backend	Notes
macOS (Apple Silicon)	MLX (Metal)	4-5x faster via Neural Engine
Windows / Linux (NVIDIA)	PyTorch (CUDA)	Auto-downloads CUDA binary from within the app
Linux (AMD)	PyTorch (ROCm)	Auto-configures HSA_OVERRIDE_GFX_VERSION
Windows (any GPU)	DirectML	Universal Windows GPU support
Intel Arc	IPEX/XPU	Intel discrete GPU acceleration
Any	CPU	Works everywhere, just slower

Use Cases

Game development -- generate dynamic dialogue for characters
Content creation -- produce podcasts and video voiceovers
Accessibility -- build text-to-speech tools for users who need them
Voice assistants -- create custom voice interfaces
Production pipelines -- automate voiceover workflows via the REST API

Tech Stack

Layer	Technology
Desktop App	Tauri (Rust)
Frontend	React, TypeScript, Tailwind CSS
State	Zustand, React Query
Backend	FastAPI (Python)
TTS Engines	Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
Effects	Pedalboard (Spotify)
Transcription	Whisper / Whisper Turbo (PyTorch or MLX)
Inference	MLX (Apple Silicon) / PyTorch (CUDA/ROCm/XPU/CPU)
Database	SQLite
Audio	WaveSurfer.js, librosa