Overview
Voicebox uses OpenAI's Whisper for automatic speech recognition (ASR). Transcription powers two flows:
- Reference-text auto-fill — when a user records or uploads a voice sample, the backend transcribes it and populates the
reference_textfield so cloning can use it. - On-demand transcription — a user-facing
/transcribeendpoint for arbitrary audio.
On Apple Silicon, the transcription path runs through MLX-Whisper (from mlx-audio) for ~8× faster inference than PyTorch. Everywhere else it runs through PyTorch's transformers Whisper.
Architecture
Transcription is wired through the same backend abstraction as TTS. The STTBackend protocol lives in backend/backends/__init__.py:
@runtime_checkable
class STTBackend(Protocol):
async def load_model(self, model_size: str) -> None: ...
async def transcribe(
self,
audio_path: str,
language: Optional[str] = None,
model_size: Optional[str] = None,
) -> str: ...
def unload_model(self) -> None: ...
def is_loaded(self) -> bool: ...
Two implementations ship today:
MLXSTTBackend(backends/mlx_backend.py) — usesmlx_audio.stt.load(). Default on Apple Silicon.PyTorchSTTBackend(backends/pytorch_backend.py) — usestransformers.WhisperForConditionalGeneration. Default everywhere else.
get_stt_backend() picks the right one based on get_backend_type(). backend/services/transcribe.py is a thin wrapper that delegates to the backend.
Model Sizes
Five Whisper variants are registered in ModelConfig:
| Model | HuggingFace Repo | Size | Notes |
|---|---|---|---|
| Base | openai/whisper-base |
~300 MB | Default; fast, decent quality |
| Small | openai/whisper-small |
~500 MB | Better quality, still fast |
| Medium | openai/whisper-medium |
~1.5 GB | High quality |
| Large | openai/whisper-large-v3 |
~3 GB | Best quality, slow on CPU |
| Turbo | openai/whisper-large-v3-turbo |
~1.5 GB | Large-tier quality, ~5× faster than Large |
The tiny model is not exposed — the quality gap to base wasn't worth the download.
Turbo + MLX-Whisper on Apple Silicon dropped user-facing transcription latency from ~20s to ~2-3s in v0.1.10.
Language Hints
Whisper can auto-detect language, but providing a hint improves accuracy on short clips:
text = await backend.transcribe(audio_path, language="en")
Accepted language codes are the standard Whisper set (99+ languages). The frontend typically passes the profile's language if available, or lets Whisper detect otherwise.
Model Loading
Both backends are lazy: the model is loaded on first use and cached in memory. Switching sizes unloads the previous model.
On MLX, the model is loaded via mlx_audio.stt.load(hf_repo). On PyTorch, via:
WhisperProcessor.from_pretrained(hf_repo)
WhisperForConditionalGeneration.from_pretrained(hf_repo).to(device)
Both load paths use model_load_progress() from backends/base.py so the frontend sees live download progress on the first use.
Audio Preprocessing
Whisper expects mono 16 kHz audio. The audio utility in backend/utils/audio.py handles resampling and format conversion transparently:
- Formats: WAV, MP3, FLAC, OGG, M4A (via soundfile / librosa)
- Target: mono, 16 kHz, float32
Files longer than Whisper's 30-second window are handled by the underlying library's chunking logic — no explicit splitting in Voicebox code.
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /transcribe |
Transcribe an uploaded audio file |
Request
Multipart form data:
POST /transcribe
Content-Type: multipart/form-data
file: <audio_file>
language: en # optional
model_size: base # optional (default: "base")
Response
{
"text": "Hello, this is a test transcription.",
"duration": 3.5
}
Use Cases
Reference Text for Voice Cloning
Adding a voice sample triggers transcription automatically:
- User uploads or records audio.
- The backend writes the audio file and calls
/transcribeinternally (or the frontend calls it separately). - The returned text becomes
reference_texton the newprofile_samplesrow. - Cloning engines that need reference text (Chatterbox, TADA, etc.) read it from there.
Quality Tips
- Provide a language hint for short clips (under 5 seconds) — auto-detection is unreliable on little audio.
- Use Turbo or Large for noisy audio — Base can hallucinate on hard inputs.
- Prefer clean audio; transcription errors become reference-text errors, which become cloning errors.
Memory Management
unload_model() drops the model reference and clears the CUDA cache if applicable. /models/unload wires this up for manual control.
A singleton per backend is returned by get_stt_backend() — multiple callers share one Whisper instance.
Error Handling
| Error | Cause | Solution |
|---|---|---|
| Model not found | First run + network failure | Retry; check connectivity |
| OOM on load | Large model on low-VRAM GPU | Switch to Small or Turbo |
| Empty result | No speech in audio | Confirm input has voice; check trim |
| Wrong language | Auto-detect misfired | Pass language hint |
Next Steps
Download / load / unload any model
How reference text is stored alongside samples
Platform-specific acceleration including MLX-Whisper