Voice Cloning | Voicebox

Overview

Voicebox can replicate a specific person's voice from a short audio sample — known as zero-shot voice cloning. You provide 10-30 seconds of clear speech, the model extracts a voice embedding, and from then on you can generate any text in that voice.

Five engines in 0.4 support cloning:

Engine	Languages	Strengths
Qwen3-TTS (0.6B / 1.7B)	10	High-quality multilingual, supports delivery instructions on the same kwarg
Chatterbox Multilingual	23	Broadest language coverage — Arabic, Hindi, Swahili, Hebrew, more
Chatterbox Turbo	English	Fast 350M model with paralinguistic emotion tags (`[laugh]`, `[sigh]`)
LuxTTS	English	Lightweight (~1 GB VRAM), 48 kHz output, 150x realtime on CPU
TADA (1B / 3B)	10	Speech-language model with 700s+ coherent long-form generation

Don't want to record audio? Use a curated voice from Kokoro or Qwen CustomVoice instead — see [Preset Voices](/overview/preset-voices).

How It Works

Upload or Record Sample

Provide 10-30 seconds of clear speech from the target voice

Engine Analysis

The selected engine analyzes vocal characteristics, tone, and speaking patterns

Voice Profile Created

A voice embedding is generated and stored with your profile

Generate Speech

Use the profile to generate any text in the cloned voice

Choosing an Engine for Cloning

Different engines suit different use cases. The profile grid greys out unsupported engines so you can switch easily.

If you want…	Pick
Best overall quality on a few common languages	Qwen3-TTS 1.7B
Faster generation, slightly lower quality	Qwen3-TTS 0.6B
Languages outside Qwen's 10 (Arabic, Hindi, etc.)	Chatterbox Multilingual
Expressive English with `[laugh]` `[sigh]` tags	Chatterbox Turbo
CPU-only or GPU-light setup, English	LuxTTS
Long-form generation (audiobooks, full chapters)	TADA 3B

Best Practices

Sample Quality

Use 10-30 seconds of audio
Clear, consistent speaking
Minimal background noise
Natural speaking pace

Don't

Very short clips (< 5 seconds)
Heavy background noise
Music or overlapping voices
Heavily processed audio

Multiple Samples

Adding multiple samples from the same speaker can improve quality:

Different speaking styles (casual, formal)
Different emotions (happy, serious)
Different recording conditions

The model will learn a more robust representation from diverse samples. Especially helpful for distinctive voices the model might otherwise smooth over.

Supported Languages by Engine

Qwen3-TTS — English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian (10)
Chatterbox Multilingual — Arabic, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Turkish (23)
Chatterbox Turbo — English
LuxTTS — English
TADA 3B — 10 multilingual; TADA 1B — English

For complete language tables and engine-specific notes, see the TTS Engines developer guide.

Limitations

Voice cloning should only be used with consent. Ensure you have permission to clone someone's voice. See the project's [SECURITY.md](https://github.com/jamiepine/voicebox/blob/main/SECURITY.md) and your local laws on synthetic voice content.

Quality depends on sample clarity — noisy samples produce noisy clones
Works best with consistent speaking tone within a sample
May struggle with extreme accents or speech impediments
Background noise reduces quality and can introduce artifacts

Next Steps

Creating Voice Profiles

Step-by-step guide to creating profiles

Preset Voices

Use built-in voices instead of cloning

Generating Speech

Use a profile to generate audio