Model Management | Voicebox

Overview

Voicebox manages two categories of models:

TTS Models — Seven engines covering zero-shot cloning and preset voices. Each engine may have one or more size variants.

ASR Models — Whisper for transcription. Five sizes, plus MLX-Whisper on Apple Silicon for ~8× faster transcription.

Every model is described by a ModelConfig entry in backend/backends/__init__.py. Models are downloaded from HuggingFace Hub on first use and cached in the platform-standard HF cache.

Available TTS Models

Model	Engine	HuggingFace Repo	Size	VRAM	Languages
Qwen TTS 1.7B	`qwen`	`Qwen/Qwen3-TTS-12Hz-1.7B-Base`	3.5 GB	~6 GB	10
Qwen TTS 0.6B	`qwen`	`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	1.2 GB	~2 GB	10
Qwen CustomVoice 1.7B	`qwen_custom_voice`	`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	3.5 GB	~6 GB	10
Qwen CustomVoice 0.6B	`qwen_custom_voice`	`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	1.2 GB	~2 GB	10
LuxTTS	`luxtts`	`YatharthS/LuxTTS`	300 MB	~1 GB	English
Chatterbox Multilingual	`chatterbox`	`ResembleAI/chatterbox`	3.2 GB	~3 GB	23
Chatterbox Turbo	`chatterbox_turbo`	`ResembleAI/chatterbox-turbo`	1.5 GB	~1.5 GB	English
TADA 1B	`tada`	`HumeAI/tada-1b`	4 GB	~4 GB	English
TADA 3B Multilingual	`tada`	`HumeAI/tada-3b-ml`	8 GB	~8 GB	10
Kokoro 82M	`kokoro`	`hexgrad/Kokoro-82M`	350 MB	~150 MB	8

On Apple Silicon, Qwen TTS uses MLX-optimized repos from mlx-community instead of the PyTorch repos. The backend picks automatically via get_backend_type().

Available Whisper Models

Model	HuggingFace Repo	Size
Whisper Base	`openai/whisper-base`	~300 MB
Whisper Small	`openai/whisper-small`	~500 MB
Whisper Medium	`openai/whisper-medium`	~1.5 GB
Whisper Large	`openai/whisper-large-v3`	~3 GB
Whisper Turbo	`openai/whisper-large-v3-turbo`	~1.5 GB

On Apple Silicon, MLX-Whisper is preferred automatically — see Transcription.

Model Storage

Models live in the platform HuggingFace cache:

Platform	Path
macOS	`~/.cache/huggingface/hub/`
Linux	`~/.cache/huggingface/hub/`
Windows	`%USERPROFILE%\.cache\huggingface\hub\`
Docker	`/home/voicebox/.cache/huggingface/hub` (volume-mounted)

Set VOICEBOX_MODELS_DIR to override.

Progress Tracking

Downloads stream progress to the frontend via Server-Sent Events. The progress pipeline has three pieces:

ProgressManager (backend/utils/progress.py) — in-memory map of model_name → {current, total, filename, status}.

HFProgressTracker — context manager that intercepts HuggingFace Hub downloads to emit byte-level progress. Needed because huggingface_hub silently disables tqdm in frozen PyInstaller builds.

SSE endpoint — GET /models/progress/{model_name} streams updates until status is complete or error.

# Frontend
const eventSource = new EventSource(`/models/progress/${modelName}`);
eventSource.onmessage = (event) => {
  const { current, total, status } = JSON.parse(event.data);
  updateProgressBar(current / total);
  if (status === "complete") eventSource.close();
};

Model Status

GET /models/status returns every registered model's current state:

{
  "models": [
{
  "model_name": "qwen-tts-1.7B",
  "display_name": "Qwen TTS 1.7B",
  "engine": "qwen",
  "downloaded": true,
  "size_mb": 3500,
  "loaded": true
},
...
  ]
}

The handler iterates get_all_model_configs() and calls check_model_loaded(config) for each entry, so new engines appear automatically once they're registered in ModelConfig.

Manual Model Operations

Method	Endpoint	Description
GET	`/models/status`	Status of every registered model
POST	`/models/load`	Load a TTS model into memory
POST	`/models/unload`	Unload a TTS model from memory
POST	`/models/download`	Trigger a background download
GET	`/models/progress/{name}`	Stream download progress (SSE)
DELETE	`/models/{name}`	Delete a downloaded model from cache

Load

POST /models/load
{
  "model_name": "qwen-tts-1.7B"
}

The route looks up the config, dispatches to get_model_load_func(config), and returns once the model is ready.

Unload

POST /models/unload
{
  "model_name": "chatterbox-tts"
}

Calls unload_model_by_config(config), which routes to the right backend's unload_model() and frees GPU memory.

Download

POST /models/download
{
  "model_name": "kokoro"
}

Fires off an async download task. Progress is available via the SSE endpoint. Download is triggered automatically on first generation, so this is only needed for pre-warming.

Preset Voice Seeding

For engines that use preset voices (Kokoro, Qwen CustomVoice), the backend auto-creates a voice profile per preset voice after the model is downloaded. This is driven by seed_preset_profiles(engine) in backend/services/profiles.py, called from the models route once download completes.

Preset profiles have:

voice_type = "preset"
preset_engine = engine name ("kokoro", "qwen_custom_voice")
preset_voice_id = engine-specific voice ID ("am_adam", "f000001", etc.)
No profile_samples rows — no audio to store

See Voice Profiles for the schema.

Adding a New Model

To add a new size variant of an existing engine, just add another ModelConfig:

ModelConfig(
model_name="qwen-tts-3B",
display_name="Qwen TTS 3B",
engine="qwen",
hf_repo_id="Qwen/Qwen3-TTS-12Hz-3B-Base",
model_size="3B",
size_mb=7000,
languages=["zh", "en", ...],
),

The frontend picks it up via /models/status; download/load flow works without further changes.

Adding a whole new engine is a bigger lift — see TTS Engines for the full phased workflow.

Error Handling

Error	Cause	Fix
Download failed	Network / HF rate limit	Retry
OOM on load	Not enough VRAM	Use a smaller variant, unload other engines
Model not found	Corrupt cache	Re-download via `/models/download`
Stuck progress bar in frozen build	`huggingface_hub` tqdm silenced	`HFProgressTracker` force-enables the internal counter
GPU architecture unsupported	PyTorch wheel doesn't target your GPU	See GPU Acceleration

Next Steps

TTS Generation

How generation flows through the registry

TTS Engines

Add a new engine end-to-end

Transcription

Whisper and MLX-Whisper integration